omnideconv / deconvBench

Comparison of 2nd generation deconvolution methods implemented in omnideconv
2 stars 0 forks source link

Step-by-step benchmarking plan #18

Closed FFinotello closed 9 months ago

FFinotello commented 2 years ago

Hi @mlist,

I opened this issue to coordinate the next steps in our benchmarking study.

I'd say the first important decision to take regards the single-cell data we use (for signature building and pseudo-bulk simulation). I would propose three datasets:

Looking forward to your thoughts.

Meanwhile, I tag @grst @alex-d13 and @LorenzoMerotto who can provide valuable input.

Ciao, Francesca

grst commented 2 years ago

I agree that this is a good selection. Depending on the use-case you could also consider the HLCA instead of my LuCA. Might also be an option to use one for signature generation and the disjunct datasets from the other for validation.

Differences

HLCA

LuCA

The healthy controls from LuCA are almost all also part of the HLCA.

FFinotello commented 2 years ago

Hello @grst, thank you so much for your input!

I think that HLCA also cover a few diseases and has additional technologies in the extended data. @LorenzoMerotto can say more about that

But you are right that the "neutrophil" aspect is valuable for our analyses (e.g. simulation with mRNA bias)! Which datasets/studies in LuCA have the highest absolute number (not percentage) of neutrophils? And on which technologies were they based? For our benchmarking, I would focus on 10x and Smart-seq2 only.

Meanwhile... https://twitter.com/ScienceMagazine/status/1524812952712491013 :)

grst commented 2 years ago

Indeed, i was referring to the core atlas.

It seems it's the week of atlases :) Great stuff!

LorenzoMerotto commented 2 years ago

The HLCA contains data fom many type of diseases in the extended section, but it seems that they are not 'officially' annotated, meaning that we could re-annotate them ourselves using the core (with all its limitations). Speaking of the technologies included, as @grst said the majority of the data is 10x, with only a few datasets in the extended version sequenced with Dropseq/Seqwell

LorenzoMerotto commented 2 years ago

@grst I'm leaving this question here since I don't want to open a new issue. In the LuCA atlas there are both tumor and not tumor samples. However even in those datasets where all the patients are healthy (non-cancer label), such as the Travaglini dataset, there are some cells labeled as 'Tumor cells' How should we treat that annotation?

grst commented 2 years ago

These would be cells that cluster (for whatever reason) with the tumor cells from other datasets. For the deconvolution benchmark I would treat them as artifact and exclude them.