Genotyping with information of recepient

single-cell-genetics / vireo

Demultiplexing pooled scRNA-seq data with or without genotype reference

https://vireoSNP.readthedocs.io

Apache License 2.0

74 stars 27 forks source link

Genotyping with information of recepient #62

Closed flde closed 2 years ago

flde commented 2 years ago

Hi all, many thanks for the userfriendly implementation of vireo!

I hope you could give me some advice on the following project. I have single cell RNAseq data from donor/recepient over four time points. The first two time points are recepient cells only the last two time points are mixed donor/recepient cells which I would like to distinguish.

My strategy would be to run cellsnp-lite on the four BAM files first (=four time points). Next, I can extract the significant recepient SNPs from the first two time points with bcftools

bcftools view recepient.vcf.gz -R cellSNP.cells.vcf.gz -Oz -o sub.vcf.gz

and hand that list to vireo -d parameter while the -c cell data is from the third and fourth time point (=mixed time points)

vireo -c $CELL_DATA -d $RECEPIENT_GT_FILE -o $OUT_DIR -N $n_recepient

Does that make sense or do I miss something? I highly appreciate you help!

Best, Florian

huangyh09 commented 2 years ago

Hi Florian,

Thanks for your questions and your experiment settings. It looks like you only have genotypes from the recipient but not the donor, so the use of $RECEPIENT_GT_FILE may be not the best option.

The option you mentioned by pooling all four samples and demultiplexed by Vireo and then using the recipient cells as QC matrix is actually a very good one.

Alternatively, you could do it on the two samples and check the genotype differences between the recipient, e.g., with this donor matching tutorial; I assume you genotyped with cellsnp-lite in a pseudo-bulk manner).

Yuanhua

flde commented 2 years ago

Hi @huangyh09,

Many thanks for your quick response. I will try the pooling approach first so that I get familiar with the pipeline.

Regarding the second approach, I did not configure cellsnp-lite as pseudo-bulk for the recipient samples. But that makes a lot of sense since it increase the coverage per gene, right? I don't fully understand the tutorial yet. Would you compare the genotype differences across recepients and use then only high confindent areas to deconvolut the recepient/donor samples respectivly?

Best wishes, Florian

flde commented 2 years ago

Hi @huangyh09,

I pooled the BAM per patient over all four samples (baseline, day0, day14, day100) and run cellsnp-light + vireon as suggested. In the first two samples we only have recepient cells while the later samples are a mixture of recepient and donor. The results look really good for patient 766, 768, 784, 785. For patient 764 we only have recepient cells which makes the analysis obsolet.

However, for patient 763 and 783 the results do not look as good. We know that the cell transplant for patient 763 failed so we do not expect many donor cells at day14.

I went through the tutorials looking for options to increase the sensitivity of the approach. Do you have any recomendation what I could try to recover those samples? I have a maximum of 48 threads on the cluster in case its computational heavy.

The performance of the tool on the good samples (many cells / good recepient to donor balance) is really nice however. Many thanks!

download

huangyh09 commented 2 years ago

Hi Florian, Thanks for sharing your results - very cool. I agree that 766 and 768 work nicely. For 763 and 764, as there are few or no cells from the donor, indeed there is not much to do. Of note, Vireo still splits the receipt into two groups, unnecessarily.

The 784 and 785 look reasonable; for the high unassignable rate, are they caused by low coverage? For 784, I'm a bit concerned about the high doublet rate in the baseline sample. For 783, there might be possibilities of poor convergence or only to a local optimum. For all these three samples, I would recommend: 1) increase the number of initializations to avoid potential local optima. As your computing resource seems OK, you can try -M 200. 2) only pool the day14 and day100 and run vireo again, and see if the unassigned rate and double rate go down, especially in 783 (day100) and 784 (day14). If this works nicely, you can compare the genotype of baseline to the demultiplexed donors, there should be a clear difference.

Hope this helps. Yuanhua

flde commented 2 years ago

Hi @huangyh09,

I talked with my colaborators and they told me that for some patients the donor/recepient pairs are close relatives. I think 783 is a parent-child combination and 784 are siblings. In case of 763 the transplant probably failed so its highly likely only one genotype.

I repeated snp-lite with an allel frequency of 0.05 instead of 0.1. In addition I set -M 500 since it only takes 20 min. For plotting I filtered the cells for >1500 UMI | < 15 MT% | removed doublets based on HTO.

I am actually stunned about how well that worked! I am even tempted to use the results without doublet detection but not sure if they are robust e.g. there is some miss-classification on the baseline with 783.

Many thanks for your dedication and input, Yuanhua. I think I can close the issue now but if you have another advice its highly appreciated. All the best!

download