single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
73 stars 27 forks source link

Discriminative variants with GTbarcode #92

Open flde opened 8 months ago

flde commented 8 months ago

Dear all,

Many thanks for the great tool and help on GitHub. I run Vireo and the results looks very promising (single cell RNAseq | mode 1a with genome 1k).

I am now wondering which variants actually support the discrimination between donor0 and donor1. I run GTbarcode but only get one result which seems not enough. However, I can't find any help/documentation about the GTbarcode function or the vireo vcf output file - which would be helful too.

Do you have recommendations about that? Or would it be possible to parse the vireo vcf with other tools to rank variants?

Many thanks for your help! Florian

huangyh09 commented 8 months ago

Hi Florian,

Very good question. In principle, GTbarcode offers a solution for this demand, while it aims to minimize the number of variants selected as long as it can sufficiently discriminate the donors. Understandably, more variants may be wanted. One simple way is to change the --randSeed, so each time it will give you different variants with the same information gain.

Alternatively, one can output more (or all) variants for each equivalent group by changing this line (currently it only sample one for output): https://github.com/single-cell-genetics/vireo/blob/master/vireoSNP/utils/variant_select.py#L53

Yuanhua

flde commented 8 months ago

Dear Yuanhua,

Many thanks for your help! I set the following line to return idx instead of idx_use and adusted the file output writer. I get a list of variants now and I think there entropy equals max entropy.

However, can you help me please with a few concepts? When you start the while loop of variant_select the entropy is computed per variant. You break the loop if the max entropy of one iteration is smaller than the max entropy of the prvious iteration. But since the entropy is compute for each variant individually that loop will always breatk after one iteration, right?

A bit off topic, but I also used the vatrix format as vireo input but that does not yield a GT_donors.vireo.vcf.gz file which I could use for GTbarcode. Is there a script/tool to create such file manually from the output? That would be great.

Many thanks and best wishes, Florian

huangyh09 commented 8 months ago

Hi Florian, Nice fix!

For the "while loop", yes the entropy is computed for each variant, but on top of the selected variants as this line. So, for each iteration, the entropy is guaranteed to not decrease and adding more variants can't increase the entropy compared to the previous iteration, the while loop will be stopped.

Good to hear that you are using the vatrix format as input. Did you input 3 or 4 files? If using 4 files (the last one is SNPs.vcf.gz), then it can output the GT_donors.vireo.vcf.gz.

Yuanhua

flde commented 8 months ago

Hello Yuanhua, many thanks!

For completion, I forked the repository and documented documented all changes.

To be honest, I still don't understand the while loop. You wrote, each iteration will add more variants. But as I understand it you actually remove all variants with entropy < max_entropy.

I did not use the SNPs.vcf.gz file. And I think I can't recover the GT_donors.vireo.vcf.gz because the information of the GT_prob and ID_prob from the gt_results can't be recovered from the vireo output files?

Best wishes, Florian