single-cell-genetics / cellsnp-lite

Efficient genotyping bi-allelic SNPs on single cells
https://cellsnp-lite.readthedocs.io
Apache License 2.0
124 stars 11 forks source link

OTH ratio 1 #93

Open flde opened 1 year ago

flde commented 1 year ago

Hello cellsnp-lite team,

I run cellsnp-lite with default parameters in mode 1 and get SNPs with OTH ratio 1 OTH_ratio <- Matrix::rowSums(OTH) / Matrix::rowSums(DP+OTH). Does that make sense? Highly appreciate your input.

download

Best wishes, Florian

hxj5 commented 1 year ago

Hi, thanks for the feedback. It could happen that some SNPs have OTH ratio 1 in mode 1, since the REF and ALT alleles are not inferred from data but specified in the input VCF by the user. Sometimes the two specified alleles do not "express", either truely happened as there are other alleles expressed on that SNP, or artifacts due to sequencing / alignment errors etc, especially for those SNPs with low coverage.

flde commented 1 year ago

Hi @hxj5,

Many thanks for your help.

The AD, DP, and OTH matrix can be used to compute the REF, ALT, and OTHER frequency. But the annotation of the matrix is based on the reference VCF file and not derived from the single cell data. For example, a SNP which is ALT in the reference could be the REF in my own data.

Therefore, to filter by minMAF either the REF or ALT frequency needs to be greater than the threshold. But you ignore the OTHER frequency, is that correct?

Also, if AD and DP is based on the reference VCF file and not the single cell data wouldn't that be a problem for Vireo?

Best wishes, Florian

hxj5 commented 1 year ago

Thanks for the good question.

  1. Cellsnp was designed for bi-allelic SNPs. In its mode 1, REF and ALT alleles are specified by user while in mode 2, REF and ALT are inferred from data as the alleles with highest and second highest read(UMI) counts. Therefore, in mode 1, the REF or ALT in the reference VCF could be different from the major or minor allele inferred from data. For example, the ALT in VCF could be REF in the data, as you mentioned.

  2. In cellsnp cmdline (for both mode 1 and 2), MAF is always caculated as the fraction read(UMI)_count_of_minor_allele / read(UMI)_count_of_all_alleles, where the minor allele is the allele with second highest read(UMI) count inferred from data. Sorry for the misleading (updated in issue #77).

  3. Therefore, in mode 1, post-filtering SNPs based on the minimum allele frequency of the REF and ALT alleles in VCF file could be different from filtering SNPs with --minMAF in the cellsnp cmdline, for a small subset of SNPs whose major allele (with highest read/UMI count) or minor allele (second highest) is neither REF or ALT allele but one of the OTH alleles (updated in issue #90).

  4. The number of SNPs whose major or minor allele is one of the OTH alleles is expected to be quite small (in mode 1), given the input reference VCF is reliable (e.g., with common SNPs compiled from 1000 genome project), hence should have limited influence on downstream donor deconvolution.

flde commented 1 year ago

Dear @hxj5,

I was on vaccation last week, sorry for the delay. Many thanks for the detailed answer that is very helpful! I managed to write a script that sorts and filters the cellsnp output into vartrix format. Here the REF and ALT matrix is based on the single cell data. I will do some tests and close update/close the thread afterwards.

Again, many thanks for your time!

Best wishes, Florian

flde commented 1 year ago

Dear @hxj5 and @huangyh09,

Many thanks for helping me with this issue and related #90 #62. I wrote a script to convert the cellsnp output to vatrix format and filter the vatrix matrix by minMAF. Hence, filter the matrix by the ALT frequency according to the single cell data.

For the plots below I show the cellsnp output with minMAF=0 and varied the minMAF for the vatrix output. For each patient we have samples from different time points which are either only host or mixed host/donr samples. I pooled those per patient and run cellsnp mode 1 with minMAF=0. Vireo was run with -M 150. The Vireo output can then be finally by time point to test if only donor samples were annotated correctly (not shown). But below I show only the results for the pooled samples per patient.

Some quick information on the patients: Patient_1: Only very few donor cells expected. Patient_2: Only donor cells Patient_5: Siblings Patient_7: Close relatives Patient_9: Siblings

Vireo is capable of handling samples where there is only one genotype present (patient 2) and seems quite sensible if one genotype is overrepresented (patient 1). In that case smaller minMAF give better results. In case the patient is unrelated (patient 3 and 4) the results are independent of minMAF. In case they are closely related the analysis yields mostly unassigned cells and the change of minMAF either has minor effects (patient_5 and patient_7) or very strong effects (patient 9).

From the second figure I conclude that for closely related patients I could try to relax the min_prob filter to reduce the unassigned cells #24.

download download

I hope that is interesting for you and I want to say many thanks again with all your help!

Best wishes, Florian