Open wgmao opened 1 month ago
Thank you very much for your interest in ELLA!
sc_total
is the total number of transcripts in the cell, and it should be the same number for the same cell. For the mini demo, we only included 5 genes in the analysis, so the total number of transcripts of these 5 genes is much less than the total number of all transcripts. This is a good point -- in this case, would it make more sense to use the total number of the transcripts included or the total count of all transcripts measured? We'd be happy to hear comments on this!.pv_fdr_tl()
Yes, the P values before FDR correction are stored in .pv_cauchy_tl
. (pv_cauchy_tl
are the Cauchy combined P values, one for each gene, without FDR correction; pv_raw_tl
are the P values before Cauchy combination, by default, 22 for each gene, corresponding to the 22 default Beta kernels.)Thank you!
Thank you for your prompt response!
RE sc_total
I tried to make sense of sc_total
column as I didn’t find explanations here (https://jadexq.github.io/ELLA/inputs.html). My guess is that demo data is a small proportion from the complete dataset, so the discrepancy is expected. I have no clue on which choice would be better at this point, because the data (Xenium) I have only has at most hundreds of transcripts per cell.
RE .pv_fdr_tl()
Thank you for the clarification!
RE. Sanity check Is there a way to monitor the loss curve? I feel that will be a very useful utility function but I don’t find instructions from the website. Apologize that I can’t share the data at the moment, but I am happy to follow up with any detailed information you may need.
One follow-up question, I tried to apply the visualization section in the mini_demo on my data. I have no problem plotting the estimated expression intensities. But there was an error plotting the cells and genes, and it raised from the following line,
cb_x_, cb_y_ = alpha_shape_.exterior.xy
AttributeError: 'LineString' object has no attribute 'exterior'
Do you have any insight into what may be wrong with the input data? For example, should I sort the boundary coordinates in some ways to make the plotting functions work?
Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!
Hi @jadexq,
I am observing something similar in my Merfish data. Testing 200 cells and 550 genes for a start.
Here is a histogram of my p-values
After FDR all my p-values are 1, but even the individual raw values for the kernels are extremly high.
This is a little bit suprising to me as just by eye I think that I can see some patterns:
Gene 1
Gene 2
Do these p-values look like what you would expect?
I could also supply my data if that would help (I would have to redact gene names)
Hi @pakiessling,
Thank you very much for sharing the plots and for echoing on the issue! No, the p-values do not look like I'd expected. Given such strong patterns, I would expect very small p-values. It would be very helpful if you could share your data with me! So that I can try it out by myself (and potentially create a demo with synthesized or similar public data; totally fine to redact gene names etc.). Thank you very much!
I have uploaded the data here
https://rwth-aachen.sciebo.de/s/SGFXSErjDszw1py
let me know if that works
Thank you very much @pakiessling ! That worked perfectly!
I uploaded a notebook here. (Just let me know if you are okay with me putting it here showing some plots of your redacted data, or I can remove it.) I included a plot for checking the loss/convergence, and a plot for cells corresponding to genes with subcellular patterns. (I probably should integrate these into ELLA as well.)
The main problem that I fixed is the analytical solution for homogeneous Poisson process model (the null model) was not working well. I updated the ELLA default parameter hpp_solution
from analytical
to 'numerical'. (You can also specify this when instantiating ELLA with hpp_solution='numerical'
.) This may slow ELLA down a little bit but should not be noticeable.
Thank you for your response! I hope to have a quick follow-up first (and I'll follow up more soon). It is totally understandable that you could not share your data. Instead, if you could share, for a gene of focus, (1) how many cells (approximately) you'd like to work on, and (2) usually how many expression counts this gene has per one cell, that would be very helpful. I can create some synthesized data to test it out and create a demo with Sanity checks and visualizations. If you couldn't share (1) or (2), that's also okay, I can create synthesized data based on some public 10x data. Thank you very much!
Hi @wgmao
sc_total
Thank you for catching this! I updated the tutorial page by adding the explanation for sc_total
.
Sanity check
This notebook includes a plot of loss function. The value of loss function in each iteration are stored in loss_nhpp_dict[cell type: str][gene index: int][beta kernel index: int]. I am sorry for the inconvenience at this moment! I'll included an utility function into ELLA hopefully soon.
An error plotting the cells and genes
From the error message, it seems it says the shape/boundary of the cell, instead of a polygon, is a LineString without any area inside it. This is not expected for 10x data.
I am working on a public 10x data for creating a new demo for ELLA. The cell segmentation I got from outs/cell_boundaries.csv.gz
looks like:
vertex_x vertex_y
cell_id
1 849.7875 322.3625
1 844.2625 323.2125
1 841.5000 324.4875
1 843.2000 327.2500
1 844.9000 328.7375
...
I was able to use code like the following to plot the cell boundaries in the 10x data I am working on. Or the code in the notebook (at the end of it) for most datasets.
cell_shape_cl_test = cseg_df_test.groupby('cell_id').apply(lambda group: Polygon(zip(group['vertex_x'], group['vertex_y'])))
c = xxx
cell_shape = cell_shape_cl_test.loc[c]
x, y = cell_shape.exterior.xy
ax.plot(x, y, color='gray', lw=1)
While it seems sorting by angle around a central point is sometimes suggested. Happy to discuss more on this!
Thanks @jadexq,
So to summarize, moving forward I should use the newest commit, make sure all my boundaries have at least 100 points and adjust the learning rate.
Curious about the adam_learning_rate
. Do you think the values you chose in the notebook are superior? Should this be adjusted based on the loss curve (e.g lowering it if there are jumps?)
Thank you for developing such a wonderful tool. The tutorial is concise and easy to follow, and I have been trying ELLA on my own dataset. I have a few questions after some preliminary runs.
sc_total
(one column inexpr
) represent the the total number of transcripts in the assigned cell? If that's the case, all transcripts from the same cell should have exactly the same number. I found the number in the mini demo is much larger than the total transcript number..pv_fdr_tl()
prints out all corrected p values. Is there a way to extract the raw p values before correction?adam_learning_rate
(1e-3, 1e-2) andmax_iter
(from 10 to 5000). It could be possible that the data itself doesn't encode any significant spatial pattern. But is there sanity check I can do or any parameters I can investigate to make sure there is no obvious mistake?Thank you!