single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference
https://vireoSNP.readthedocs.io
Apache License 2.0
71 stars 25 forks source link

VCFfiles: seperated or merged #105

Open hsymoon opened 1 month ago

hsymoon commented 1 month ago

Hi developers, Thanks for developing this helpful tool. I encountered two questions when I used vireo. As I have different time experiment data from 10x scrna-seq (batch1 for D1 , batch2 for D2 ,...). Cellsnp-lite was used to call common SNP for each batch, followed by vireo to demultiplex . Here we have a donor.vcf.gz .

Q1:I wonder which one could get more resonble result: 1)Seperated : CELL_FILE : .cellSNP.cells.vcf.gz for each batch from cellsnp-lite (like batch1.cellSNP.cells.vcf.gz ) DONOR_FILE: bcftools view donor.vcf.gz -R batch1.cellSNP.cells.vcf.gz -Oz -o donors.sub_Batch1.vcf.gz ~/miniconda3/bin/vireo -c batch1.cellSNP.cells.vcf.gz -d donors.sub_Batch1.vcf.gz -o ${re} -N $n --randSeed 2

2) Merged: CELL_FILE : "bcftools merge" was used to merge cellSNP.cells.vcf.gz for each batch from cellsnp-lite ,generated all.cellSNP.cells.vcf.gz. DONOR_FILE: bcftools view donor.vcf.gz -R all.cellSNP.cells.vcf.gz -Oz -o donors.sub_All.vcf.gz ~/miniconda3/bin/vireo -c all.cellSNP.cells.vcf.gz -d donors.sub_All.vcf.gz -o ${re} -N $n --randSeed 2

As I tried ,even though --randSeed was set to the same, cells in batch1 was demultiplexed to different donors in Seperated or Merged. Could you tell me which one could get more resonble result and why .Many thanks.

Q2: Mode4 in vireo was applicable when with genotype but not confident (or only for subset of SNPs). The command is : vireo -c $CELL_DATA -d $DONOR_GT_FILE -o $OUT_DIR --forceLearnGT. Could you give some examples for this mode?Sorry for my questions.

   Thank you  very much.
huangyh09 commented 1 month ago

Hi,

Thanks for sharing your experience. For Q1, I would expect both "separated" and "merged" to give very similar results, if the configuration (coverage, n_cell per donor, balance of donors, etc) is within a reasonable range. Similarly, if the configuration is fine, I would say the "separated" is good enough, as the number of SNPs is often sufficient. However, if the number of cells for each donor (or some minor donor) is very limited (e.g., <100 cells), merging multiple time points may help increase the cell numbers for each donor, while merging batches may use different sets of SNPs. I would run cellsnp-lite on all batches together, followed by vireo, for the "merged" option.

Alternatively, for the problematic batch, you can simply run vireo without reference genotype and see whether it is better aligned to the "separated" or "merged".

For Q2, this is a less commonly used option. It is similar to mode 1 without genotype, but only using the donor genotype as prior, it can be updated in the estimation. If you feel your genotype has high noise (e.g., from very shallow bulk RNA-seq), you may consider trying it.

Yuanhua

hsymoon commented 1 month ago

Hi,

Thanks for sharing your experience. For Q1, I would expect both "separated" and "merged" to give very similar results, if the configuration (coverage, n_cell per donor, balance of donors, etc) is within a reasonable range. Similarly, if the configuration is fine, I would say the "separated" is good enough, as the number of SNPs is often sufficient. However, if the number of cells for each donor (or some minor donor) is very limited (e.g., <100 cells), merging multiple time points may help increase the cell numbers for each donor, while merging batches may use different sets of SNPs. I would run cellsnp-lite on all batches together, followed by vireo, for the "merged" option.

Alternatively, for the problematic batch, you can simply run vireo without reference genotype and see whether it is better aligned to the "separated" or "merged".

For Q2, this is a less commonly used option. It is similar to mode 1 without genotype, but only using the donor genotype as prior, it can be updated in the estimation. If you feel your genotype has high noise (e.g., from very shallow bulk RNA-seq), you may consider trying it.

Yuanhua

Thanks very much for your valuable reply. It helps me a lot.