neurorestore / Libra

MIT License
151 stars 25 forks source link

How to get a relative high confidence results in the circumstance of using a imbalanced data #33

Open thecatgonewithwid opened 2 years ago

thecatgonewithwid commented 2 years ago

Hi,

A lot of thanks to you and your team for the great contributions to singlecell DE analysis and making this wonderful package !

I was using Libra to run DE analysis in my own sc-seq dataset.However I have a few questions about how these data type present below influences the final statistical power in finding real DE genes(pesudo methods)

Type one : Imbalanced cell number data when a certain celltype number vary dramatically between biological replicates .

For example:

data like this

Biologicalreplicates | Celltype num | Label SampleA | 2000 | control SampleB | 3000 | control SampleC | 4000 | control SampleD | 1000 | case SampleE | 2000 | case SampleF | 1500 | case

Question : Can i choose a cell number ,for instance 1000 or even samller one as a new celltype number for every Biologicalreplicates ,and then resample every Biologicalreplicates to make a balanced data for pesudo-bulk ?

Type two : DE analysis between different celltype

Question : In my understandings , pesudo-methods are better than singcell-methods in the circumstance of making DE within a certain celltype ,is it also a good method in the circumstance of making DE between different celltype (find important marker gene)?

Forgive my poor english expression and awful question format , Hope to get your reply !

Thanks Yufeng

jordansquair commented 2 years ago

Balancing your data is an interesting question. We never benchmarked it but the premise behind modelling the counts should avoid this as being an issue. With regards to marker genes, in theory the same principles should hold, so you could use it for this. However, it might make things difficult to compare with existing atlases of your tissue.