Questions after running ELLA

wgmao commented 1 month ago

Thank you for developing such a wonderful tool. The tutorial is concise and easy to follow, and I have been trying ELLA on my own dataset. I have a few questions after some preliminary runs.

Does the sc_total (one column in expr) represent the the total number of transcripts in the assigned cell? If that's the case, all transcripts from the same cell should have exactly the same number. I found the number in the mini demo is much larger than the total transcript number.
.pv_fdr_tl() prints out all corrected p values. Is there a way to extract the raw p values before correction?
For my own dataset (Xenium), the corrected p values are all 1 no matter how I tune the adam_learning_rate (1e-3, 1e-2) and max_iter (from 10 to 5000). It could be possible that the data itself doesn't encode any significant spatial pattern. But is there sanity check I can do or any parameters I can investigate to make sure there is no obvious mistake?

Thank you!

jadexq commented 1 month ago

Thank you very much for your interest in ELLA!

RE sc_total That's correct, sc_total is the total number of transcripts in the cell, and it should be the same number for the same cell. For the mini demo, we only included 5 genes in the analysis, so the total number of transcripts of these 5 genes is much less than the total number of all transcripts. This is a good point -- in this case, would it make more sense to use the total number of the transcripts included or the total count of all transcripts measured? We'd be happy to hear comments on this!
RE .pv_fdr_tl() Yes, the P values before FDR correction are stored in .pv_cauchy_tl. (pv_cauchy_tl are the Cauchy combined P values, one for each gene, without FDR correction; pv_raw_tl are the P values before Cauchy combination, by default, 22 for each gene, corresponding to the 22 default Beta kernels.)
RE Sanity check That's a great suggestion! Sanity checks are not currently provided. We'll integrate some into the ELLA pipeline soon! There are a few ways of checking if the algorithm is doing alright, for example plotting the loss function and seeing if it really converged. On the data side, like if the ELLA preprocessing or if the input data is working correctly, visualization can be helpful. As for now, if it is okay to share some of your data, I'd be very happy to run it and see if I can help to identify any potential reasons.

Thank you!

wgmao commented 1 month ago

Thank you for your prompt response!

RE sc_total I tried to make sense of sc_total column as I didn’t find explanations here (https://jadexq.github.io/ELLA/inputs.html). My guess is that demo data is a small proportion from the complete dataset, so the discrepancy is expected. I have no clue on which choice would be better at this point, because the data (Xenium) I have only has at most hundreds of transcripts per cell.

RE .pv_fdr_tl() Thank you for the clarification!

RE. Sanity check Is there a way to monitor the loss curve? I feel that will be a very useful utility function but I don’t find instructions from the website. Apologize that I can’t share the data at the moment, but I am happy to follow up with any detailed information you may need.

One follow-up question, I tried to apply the visualization section in the mini_demo on my data. I have no problem plotting the estimated expression intensities. But there was an error plotting the cells and genes, and it raised from the following line,

 cb_x_, cb_y_ = alpha_shape_.exterior.xy
AttributeError: 'LineString' object has no attribute 'exterior'

Do you have any insight into what may be wrong with the input data? For example, should I sort the boundary coordinates in some ways to make the plotting functions work?

jadexq commented 1 month ago

Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!

pakiessling commented 1 month ago

Hi @jadexq,

I am observing something similar in my Merfish data. Testing 200 cells and 550 genes for a start.

Here is a histogram of my p-values grafik

After FDR all my p-values are 1, but even the individual raw values for the kernels are extremly high.

This is a little bit suprising to me as just by eye I think that I can see some patterns:

Gene 1

Gene 2

Do these p-values look like what you would expect?

I could also supply my data if that would help (I would have to redact gene names)

jadexq commented 1 month ago

Hi @pakiessling,

Thank you very much for sharing the plots and for echoing on the issue! No, the p-values do not look like I'd expected. Given such strong patterns, I would expect very small p-values. It would be very helpful if you could share your data with me! So that I can try it out by myself (and potentially create a demo with synthesized or similar public data; totally fine to redact gene names etc.). Thank you very much!

pakiessling commented 1 month ago

I have uploaded the data here

https://rwth-aachen.sciebo.de/s/SGFXSErjDszw1py

let me know if that works

jadexq commented 1 month ago

Thank you very much @pakiessling ! That worked perfectly!

I uploaded a notebook here. (Just let me know if you are okay with me putting it here showing some plots of your redacted data, or I can remove it.) I included a plot for checking the loss/convergence, and a plot for cells corresponding to genes with subcellular patterns. (I probably should integrate these into ELLA as well.)

The main problem that I fixed is the analytical solution for homogeneous Poisson process model (the null model) was not working well. I updated the ELLA default parameter hpp_solution from analytical to 'numerical'. (You can also specify this when instantiating ELLA with hpp_solution='numerical'.) This may slow ELLA down a little bit but should not be noticeable.

jadexq commented 1 month ago

Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!

Hi @wgmao

sc_total

Thank you for catching this! I updated the tutorial page by adding the explanation for sc_total.

Sanity check

This notebook includes a plot of loss function. The value of loss function in each iteration are stored in loss_nhpp_dict[cell type: str][gene index: int][beta kernel index: int]. I am sorry for the inconvenience at this moment! I'll included an utility function into ELLA hopefully soon.

An error plotting the cells and genes

From the error message, it seems it says the shape/boundary of the cell, instead of a polygon, is a LineString without any area inside it. This is not expected for 10x data.

I am working on a public 10x data for creating a new demo for ELLA. The cell segmentation I got from outs/cell_boundaries.csv.gz looks like:

         vertex_x  vertex_y
cell_id                    
1        849.7875  322.3625
1        844.2625  323.2125
1        841.5000  324.4875
1        843.2000  327.2500
1        844.9000  328.7375
...

I was able to use code like the following to plot the cell boundaries in the 10x data I am working on. Or the code in the notebook (at the end of it) for most datasets.

cell_shape_cl_test = cseg_df_test.groupby('cell_id').apply(lambda group: Polygon(zip(group['vertex_x'], group['vertex_y'])))
c = xxx
cell_shape = cell_shape_cl_test.loc[c]
x, y = cell_shape.exterior.xy  
ax.plot(x, y, color='gray', lw=1)

While it seems sorting by angle around a central point is sometimes suggested. Happy to discuss more on this!

pakiessling commented 1 month ago

Thanks @jadexq, So to summarize, moving forward I should use the newest commit, make sure all my boundaries have at least 100 points and adjust the learning rate. Curious about the adam_learning_rate. Do you think the values you chose in the notebook are superior? Should this be adjusted based on the loss curve (e.g lowering it if there are jumps?)

jadexq / ELLA

Questions after running ELLA #3