single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

scRNA-seq from different donors as a genotype-vcf input for vireo #100

Open mariafiruleva opened 10 months ago

mariafiruleva commented 10 months ago

Hi!

First of all, thank you for the great tool.

I have single-cell RNA sequencing (cell-ranger) data for e.g 3 donors (=> 3 bam files, one per donor), as well as pooled scRNA-seq data for the same donors (=> 1 bam file, the same 3 donors).

I want to call variants for the non-pooled scRNA-seq 3 bam files and then use them as donor-wise vcf inputs for vireo in order to demultiplex the pooled one. What is the best approach to do that?

Thank you very much!

Best, Mariia

hxj5 commented 10 months ago

Hi, thanks for the qeuestion.

You may combine the donor-wise VCF files with bcftools merge and then pass the merged VCF to vireo -d $DONOR_GT_FILE. See vireo issue 13 and issue 33 for detailed discussion and its manual for full parameters.

mariafiruleva commented 10 months ago

Hi, thanks for the qeuestion.

You may combine the donor-wise VCF files with bcftools merge and then pass the merged VCF to vireo -d $DONOR_GT_FILE. See vireo issue 13 and issue 33 for detailed discussion and its manual for full parameters.

Thanks a lot for your feedback!

As far as I understand, cellsnp-lite was used (issue 33) on bulk RNA-seq data which is not my case.

I ran cellsnp-lite using mode 1a (scRNA-seq data with input barcodes & bam files & --genotype): genotype information (GT) only available in the cellSNP.cells.vcf.gz file at single-cell level. I need this information at donor level in order to be able to use it for demultiplexing.

My question is: should I use different mode / specify additional parameters for mode1a / manually extract GT information from cellSNP.cells.vcf.gz?

hxj5 commented 10 months ago

EDIT: To genotype 10x scRNA-seq data in a pseudo-bulk manner with cellsnp-lite mode 1b (or mode 2b), it is recommended to subset the BAM file first, by extracting the alignment records with valid cell barcodes only. Here the valid cell barcodes are typically the cell barcodes stored in the cellranger output folder filtered_gene_bc_matrices, which are the cells with high-quality sequencing data.

We may update cellsnp-lite to enable genotyping specific cells from 10x scRNA-seq BAM file in a pseudo-bulk manner without the need to subset (e.g., by simply adding "GT" and/or "PL" fields into cellSNP.base.vcf file, or adding an --bulk option to explicitly inform cellsnp-lite to genotype in a pseudo-bulk manner when -b is specified). (20230824)


original answer:

Thanks for the clarification.

You may try using cellsnp-lite to genotype each donor in a pseudo-bulk manner (e.g., with cellsnp-lite mode 1b & --genotype). The output cellSNP.cells.vcf.gz should contain GT and PL tags (note that GT, GP, PL are all valid values for vireo --genoTag while PL is the default).

mariafiruleva commented 10 months ago

Thanks for the clarification.

You may try using cellsnp-lite to genotype each donor in a pseudo-bulk manner (e.g., with cellsnp-lite mode 1b & --genotype). The output cellSNP.cells.vcf.gz should contain GT and PL tags (note that GT, GP, PL are all valid values for vireo --genoTag while PL is the default).

Thanks a lot again!

I'm worrying if I should specify --UMItag in that case (mode1b + --genotyping) -- do you have any concerns about that?

hxj5 commented 10 months ago

Specifying --UMItag in mode 1b still works, it should count UMIs instead of reads.

mariafiruleva commented 10 months ago

I ran cellsnp-lite with two modes: one without providing GT, and the other with mode1b and the --genotype (as you suggested).

The results are highly similar, with more than 95% of cells assigned to specific donors based on GT corresponding to cells assigned to anonymously-labelled donors without GT; unassigned cells and doublets were also highly overlapped between the two modes. I was also happy to see that cells identified as unassigned in both modes were low-quality cells, based on their mitochondrial content, number of genes, and number of counts.

Thanks a lot for your help!