single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
131 stars 11 forks source link

Filter SNPs by minMAF after running cellsnp-lite #90

Closed flde closed 1 year ago

flde commented 1 year ago

Hello all,

I have clinical samples of mixed host/donor cells. I found that minMAF might be optimized for each patient. Thanks to your help I understand now that minMAF filters the SNP list based on the MAF diverged from the single cell reads (#77).

To generate SNP lists with different minMAF I would need to run cellsnp-lite iteratively which takes a lot of time/resources. Do I understand correctly that I could instead filter the cellsnp-lite result vcf.gz file with bcftools at different minMAF threshold before passing it to vireo?

Many thanks again for your help!

Best wishes, Florian

hxj5 commented 1 year ago

Hi Florian,

Thanks for the question. For typical task of donor deconvolution, such as on scRNA-seq data of mixed cells from multiple patients, you only need to perform SNP calling & genotyping once before passing the genotypes to vireo. Details of genotyping could be found in vireo manual. Please correct me if I misunderstood your question.

Best, Xianjie

flde commented 1 year ago

Hi @hxj5,

My apologies. What I want to test is different minMAF threshold e.g. 10%, 5%, 1%. But I want to avoid running cellsnp for every threshold. So, could I run cellsnp once with minMAF 1% and then filter the result cellSNP.base.vcf.gz by 10% and 5% to achieve the same result?

For filtering the result file, I would use bcftools view -q 0.05:minor cellSNP.base.vcf.gz and pass this to vireon.

Best wishes, Florian

hxj5 commented 1 year ago

EDIT: post-filtering SNPs based on the minimum allele frequency of the REF and ALT alleles in VCF file could be different from filtering SNPs with --minMAF in the cellsnp cmdline, for a small subset of SNPs whose major allele (with highest read/UMI count) or minor allele (second highest) is neither REF or ALT allele but one of the OTH alleles (in mode 1). See detailed discussion on #93 (20230525)


original answer:

yes, you could post-filter SNPs in the way as you mentioned (i.e., run cellsnp with minMAF 1% and then post-filter with minMAF 5%, 10%). Thanks for the clarification.

I have little experience about bcftools view -q. Basically, to filter SNPs outputted by cellsnp, based on minMAF threshold, you could test min(AD/DP, (DP-AD)/DP) < minMAF_threshold on the cellSNP.base.vcf file and update the three matrices accordingly.

flde commented 1 year ago

@hxj5,

Many thanks! One las follow-up. Is there a tool to update the metrics you would reccomend? I see some https://rdrr.io/github/davismcc/cardelino/man/load_cellSNP_vcf.html but maybe you have a hint.

Thank you so much for your time!

Best wishes, Florian

hxj5 commented 1 year ago

Hi Florian,

I have uploaded a demo R script csp_utils.R to the scripts/utils dir (6f487d3). You may download the script and then call the update_cellsnp_matrices function to update the three sparse matrices. The usage of the function should be straightforward, although it has not been thoroughly tested.

Best, Xianjie

flde commented 1 year ago

Hi Xianjie,

That is so very kind of you! I think having such an option is very helpful to optimize cellsnp downstream.

In my case I have samples of only host and mixed host/donor cells. So, I run cellsnp+vireo on the pooled data and then split it again. So, I can use the host only samples to estimate the sensitivity/specificity.

The challenge is that the ratio of host/donor cells varies and also some host/donor are genetically closer related than others. I think in such cases one could optimize minMAF a bit. Having a tool to split the SNPs by minMAF is a great enhancement from my point of few.

Again, many thanks and all the best, Florian

flde commented 1 year ago

Hello Xianjie,

I figured out how to load and manipulate the cellsnp matrix. Currently I am running cellsnp on the filtered CellRanger output. However, in the end I will only use cells that pass my QC pipline which includes doublet removel etc.

I could now filter the cellsnp matrix for cell ids that pass QC and re-compute them before running vireo. Would that be a good idea or is there something flawed?

Many thanks for your help, Florian

hxj5 commented 1 year ago

Hi Florian,

It should be fine, provided that the number of filtered cells is limited (so that there would be little impact on the SNP calling & genotyping).

Best, Xianjie