jadexq / ELLA

https://jadexq.github.io/ELLA/
2 stars 0 forks source link

Questions after running ELLA #3

Open wgmao opened 1 month ago

wgmao commented 1 month ago

Thank you for developing such a wonderful tool. The tutorial is concise and easy to follow, and I have been trying ELLA on my own dataset. I have a few questions after some preliminary runs.

Thank you!

jadexq commented 1 month ago

Thank you very much for your interest in ELLA!

Thank you!

wgmao commented 1 month ago

Thank you for your prompt response!

RE sc_total I tried to make sense of sc_total column as I didn’t find explanations here (https://jadexq.github.io/ELLA/inputs.html). My guess is that demo data is a small proportion from the complete dataset, so the discrepancy is expected. I have no clue on which choice would be better at this point, because the data (Xenium) I have only has at most hundreds of transcripts per cell.

RE .pv_fdr_tl() Thank you for the clarification!

RE. Sanity check Is there a way to monitor the loss curve? I feel that will be a very useful utility function but I don’t find instructions from the website. Apologize that I can’t share the data at the moment, but I am happy to follow up with any detailed information you may need.

One follow-up question, I tried to apply the visualization section in the mini_demo on my data. I have no problem plotting the estimated expression intensities. But there was an error plotting the cells and genes, and it raised from the following line,

 cb_x_, cb_y_ = alpha_shape_.exterior.xy
AttributeError: 'LineString' object has no attribute 'exterior'

Do you have any insight into what may be wrong with the input data? For example, should I sort the boundary coordinates in some ways to make the plotting functions work?

jadexq commented 1 month ago

Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!

pakiessling commented 1 month ago

Hi @jadexq,

I am observing something similar in my Merfish data. Testing 200 cells and 550 genes for a start.

Here is a histogram of my p-values grafik

After FDR all my p-values are 1, but even the individual raw values for the kernels are extremly high.

This is a little bit suprising to me as just by eye I think that I can see some patterns:

Gene 1

gene1

Gene 2

gene2

Do these p-values look like what you would expect?

I could also supply my data if that would help (I would have to redact gene names)

jadexq commented 1 month ago

Hi @pakiessling,

Thank you very much for sharing the plots and for echoing on the issue! No, the p-values do not look like I'd expected. Given such strong patterns, I would expect very small p-values. It would be very helpful if you could share your data with me! So that I can try it out by myself (and potentially create a demo with synthesized or similar public data; totally fine to redact gene names etc.). Thank you very much!

pakiessling commented 1 month ago

I have uploaded the data here

https://rwth-aachen.sciebo.de/s/SGFXSErjDszw1py

let me know if that works

jadexq commented 1 month ago

Thank you very much @pakiessling ! That worked perfectly!

I uploaded a notebook here. (Just let me know if you are okay with me putting it here showing some plots of your redacted data, or I can remove it.) I included a plot for checking the loss/convergence, and a plot for cells corresponding to genes with subcellular patterns. (I probably should integrate these into ELLA as well.)

The main problem that I fixed is the analytical solution for homogeneous Poisson process model (the null model) was not working well. I updated the ELLA default parameter hpp_solution from analytical to 'numerical'. (You can also specify this when instantiating ELLA with hpp_solution='numerical'.) This may slow ELLA down a little bit but should not be noticeable.

jadexq commented 1 month ago

Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!

Hi @wgmao

sc_total

Thank you for catching this! I updated the tutorial page by adding the explanation for sc_total.

Sanity check

This notebook includes a plot of loss function. The value of loss function in each iteration are stored in loss_nhpp_dict[cell type: str][gene index: int][beta kernel index: int]. I am sorry for the inconvenience at this moment! I'll included an utility function into ELLA hopefully soon.

An error plotting the cells and genes

From the error message, it seems it says the shape/boundary of the cell, instead of a polygon, is a LineString without any area inside it. This is not expected for 10x data.

I am working on a public 10x data for creating a new demo for ELLA. The cell segmentation I got from outs/cell_boundaries.csv.gz looks like:

         vertex_x  vertex_y
cell_id                    
1        849.7875  322.3625
1        844.2625  323.2125
1        841.5000  324.4875
1        843.2000  327.2500
1        844.9000  328.7375
...

I was able to use code like the following to plot the cell boundaries in the 10x data I am working on. Or the code in the notebook (at the end of it) for most datasets.

cell_shape_cl_test = cseg_df_test.groupby('cell_id').apply(lambda group: Polygon(zip(group['vertex_x'], group['vertex_y'])))
c = xxx
cell_shape = cell_shape_cl_test.loc[c]
x, y = cell_shape.exterior.xy  
ax.plot(x, y, color='gray', lw=1)

While it seems sorting by angle around a central point is sometimes suggested. Happy to discuss more on this!

pakiessling commented 1 month ago

Thanks @jadexq, So to summarize, moving forward I should use the newest commit, make sure all my boundaries have at least 100 points and adjust the learning rate. Curious about the adam_learning_rate. Do you think the values you chose in the notebook are superior? Should this be adjusted based on the loss curve (e.g lowering it if there are jumps?)