kharchenkolab / Baysor

Bayesian Segmentation of Spatial Transcriptomics Data
https://kharchenkolab.github.io/Baysor/
MIT License
151 stars 31 forks source link

Different results with the same parameters. #137

Open giacuong171 opened 1 week ago

giacuong171 commented 1 week ago

Hi Baysor team,

I'm currently experiencing an issue where Baysor generates different baysor_count tables even though I'm using the same parameters. Has anyone else encountered this problem? I'm using Baysor v0.6.2.

giacuong171 commented 1 week ago

For example, this is a part of the segmentation.csv from the first run.

x,y,gene,empty,cell,molecule_id,prior_segmentation,confidence,compartment,nuclei_probs,assignment_confidence,is_noise 9717.0,11350.0,Snca,,CR67317a852-11920,2951698,0,0.99832,Cyto,0.13242840648552556,0.42,false 9726.0,11299.0,Snca,,CR67317a852-9821,2951699,0,0.99462,Nuclei,0.5933353653354774,0.78,false 9739.0,11301.0,Snca,,CR67317a852-9821,2951700,0,0.99835,Cyto,0.03185690107778871,0.8,false 10438.0,12060.0,Snca,,CR67317a852-11851,2951701,7513,0.99971,Cyto,0.045458917465199034,0.74,false

and this is from the second run

x,y,gene,empty,cell,molecule_id,prior_segmentation,confidence,compartment,nuclei_probs,assignment_confidence,is_noise 9717.0,11350.0,Snca,,,2951698,0,0.99832,Cyto,0.13257818690254874,0.12,true 9726.0,11299.0,Snca,,,2951699,0,0.99462,Cyto,0.5026373748293727,0.54,true 9739.0,11301.0,Snca,,CR9da53fb2f-9844,2951700,0,0.99835,Cyto,0.031866318479271016,0.46,false 10438.0,12060.0,Snca,,CR9da53fb2f-10751,2951701,7513,0.99971,Cyto,0.04547240272891517,0.74,false

giacuong171 commented 6 days ago

I have tried the new version of baysor, but the issue still occurs. The following images show the difference between the first run and the second run. Screenshot from 2024-09-23 10-56-21 Screenshot from 2024-09-23 10-57-45

VPetukhov commented 4 days ago

Hi @giacuong171 , could you please provide more statistics on the output? For example, rand index between cell assignment?

Baysor output is stochastic, so some differences are expected. Usually, it's differences in small cells, which should be filtered (i.e., forcefully assigned to background) anyway. Larger cells should have stable transcript assignment.

Also, looking at assignment_confidence helps. In cases where assignment changed, the confidence was rather low.

giacuong171 commented 4 days ago

Hi @VPetukhov, thanks for your response. The differences between the two runs are quite minimal, around 0.007% of the total molecules. Could you explain how to calculate the Rand index?

Additionally, the two images above show that the assignment_confidence values make it difficult to determine which data points should be filtered. For example, in lines 2917842, 2924302, and 2924429, the first run classifies them as non-noise with high assignment_confidence, while in the second run, they are classified as noise, despite also having high assignment_confidence.

VPetukhov commented 4 days ago

If you're using Python, then here is the sklearn function, and here is their description. And, to run it you'd need to transform all labels to integers, replacing NaNs with 0.

As for the assignment_confidence, it's not perfect indeed. But thresholding it by something like 0.8 gives more stable results and reduces contamination in cells.

giacuong171 commented 4 days ago

I obtained a Rand index of 0.894. Additionally, I have a question regarding the cell ID names. After each run, I notice a different prefix code in the cell ID names. I wonder if the numbers in the suffix remain the same and are in the same order across different runs.

VPetukhov commented 2 days ago

0.89 is pretty good. The prefix is the run id, and the suffixes do not match between runs.