mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 7 forks source link

Problems with 3DP1 and 2DL4 #24

Open alicekiki opened 9 months ago

alicekiki commented 9 months ago

Hi~ Thank you for this wonderful work, i would like to know about an issue i came across. KIR3DL3, KIR3DP1, KIR2DL4, and KIR3DL2 are four frame kir genes which in principle every individual should have, however in my analysis there are many individuals dont have 3DP1 or 2DL4 in genotyping results, and i dont find genotype ID for some samples from(http://www.allelefrequencies.net/), i would like to ask if it is normal or if i miss any parameter during my analysis?

mourisl commented 9 months ago

What is your data type, is it WGS, WES or RNA-seq? What was your running command? A recent preprint also noticed this missing of 3DP1 and 2DL4 in about 6% haplotypes (https://www.biorxiv.org/content/10.1101/2023.11.12.566753v1). How many samples do you see missing 3DP1 or 2DL4? Thank you.

alicekiki commented 9 months ago

Thank you for ur reply. My data type is WGS, the running command is: ~/T1K/run-t1k -b "$b.cram" -c ~/T1K/kiridx/kiridx_dna_coord.fa -f ~/T1K/kiridx/kiridx_dnaseq.fa --preset kir-wgs -t 8 --abnormalUnmapFlag -o kir/"kir$b" There are 621 samples, which 320 of them don't have 3DP1 in the genotyping results and 3 of them don't have 2DL4.

mourisl commented 9 months ago

Is your alignment file based on hg38 or CHM13? Can you show me the genotype results from one of the samples? Thank you.

alicekiki commented 9 months ago

My data comes from biobank, According to the UK Biobank FAQ, the CRAM files downloaded from the biobank WGS data are already aligned to the GRCh38 reference genome using the OQFE protocol, so i didn't do extra alignment. Here is one of the samples which dont have 3DP1: ( t1k_genotype.tsv+t1k_allele.tsv) Thank you~

image image
mourisl commented 9 months ago

Based on the abundance values, I think your WGS data has coverage around 20x. This coverage might be a bit low, as we can see alleles from KIR3DS1 is not distinguishable. Maybe a prefilter of removing lowly covered samples in your data set will give better overall genotype results.

alicekiki commented 9 months ago

Thank you for ur advice, i will check it. 😊

alicekiki commented 8 months ago

Hi~ I have checked coverage for those samples which don't show 3DP1, for most samples the coverage is around 30x , here is their coverage plot: image Take one sample which has 55x covarage, here is its genotype result:

image

I would like to know is it the coverage issue or is there any other explanation for this phenomenon? Thank you for your time😊~

mourisl commented 8 months ago

Indeed. Which version of T1K did you use? I recently fixed several issues in the kiridx_dna file, could you please regenerate the kir_dna_seq.fa using t1k-build and give this sample a try? The allele prediction might be wrong, but the gene presence is usually right. For those samples, do you have WES data for some of them?

alicekiki commented 8 months ago

Hi~Thanks for ur reply. I did the analysis in August, the initial version i think. I don't have WES data for those samples. I just used the newset release 1.0.5 and do it again for this sample, here is the genotype result:

image
mourisl commented 8 months ago

Thank you for sharing the results. I think this particular sample may not have 3DP1..

One another thing you can check is whether your CRAM contains unmapped reads, and what are alternative contigs names in it. Thank you.

alicekiki commented 8 months ago

Hi~ I checked GRCh38 and it indeed contains alternative contigs and there are unmapped reads. I would like to know if you have any suggestion on how to deal with this matter? I really appreacite your advice. Thank you for ur time😄~

mourisl commented 8 months ago

Could you please show me the names of these alterantive contigs? T1K also uses reads mapped to alternative contigs and unmapped reads for genotyping, but some reference has very strange alternative contig names, and T1K may miss those.

alicekiki commented 8 months ago

Here contains the alt contigs for the reference. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_report.txt I also viewed the alignments with IGV, and i looked into the 3DP1 genomic location, it doesnt seem to have alt-aware issue. image Here is an alternative location of 3DP1 and it indeed has mapping quality issue. I would like to know ur inspects on this matter. Thank you~ image

mourisl commented 8 months ago

Many KIR gene regions are homologous to each other, so there will be many multimapped reads, and their MAPQs will be 0. What is the read length of your data?

Could you please try to run T1K by replacing the option "--kir-wgs" with " -s 0.8 --relaxIntronAlign"? This will allow more mismatches in the alignment. It may lower the accuracy of allele calling, but at least we can check whether 3DP1 can be recovered.

alicekiki commented 8 months ago

It is 151 bp. I used the above command and ran the sample above, it showed 3DP1 but a very low quality score.

image image

I ran with another sample which lacks of 3DP1, it still doesn't have 3DP1.

image
mourisl commented 8 months ago

Is your data single-end or paired-end 151bp? For example, for the intermediate output from T1K, do you see sample_candidate.fq or sample_candidate_1.fq and sample_candidate_2.fq?

mourisl commented 8 months ago

It's very hard to explain. One potential error could come when extract candidate reads from the CRAM file, such as the reference download is incomplete and the candidate reads are only partial.

You may consider try to convert the CRAM file to fastq file, i.e. https://github.com/mourisl/T1K_manuscript_evaluation/tree/master/HPRC_process , and then try T1K.

Another way is try another genotyper, such as PING. But they don't support CRAM file, so I guess you can try to use candidate{1,2}.fq as the input.

alicekiki commented 8 months ago

Thank you for the advice, i will try it 😊~

alicekiki commented 8 months ago

Hi, I would like to update the issue I encountered with T1K. I found that converting the cram file to fastq file resolved the problem of missing 3DP1 in the genotyping results. This suggests that there is something in the cram format that interferes with the T1K method. I hope this information is useful for you and other users of T1K. Thank you for your advice and your wonderful work. Happy holidays 😊~

AndyCycle commented 8 months ago

这里是陈衍潮,已收到你的邮件

mourisl commented 8 months ago

Thank you for the update! I will look into this issue, and it may be also relevant to BAM input. Happy holidays!

Bondada20 commented 3 weeks ago

Hi @mourisl! I've managed to convince people in my lab to use T1K to infer HLA types. However, someone wants to see results from Optitype, as it's what they've heard of. My question is, is it fair to use the T1K-generated _candidate_1/2.fq file as input for Optitype? Optitype itself suggests using RazerS3. Would it be a fair comparison between T1K and Optitype using the T1K-generated _candidate_1/2.fq?

AndyCycle commented 3 weeks ago

这里是陈衍潮,已收到你的邮件

mourisl commented 3 weeks ago

@Bondada20 Depends, the candidate reads are selected using very relaxed criteria. So these two files include HLA reads and many other reads that may distantly look like HLA reads. So you may still need to use RazerS3 to do Optitype's mapping to find the exact set of HLA reads that Optitype may desire.

Bondada20 commented 3 weeks ago

@mourisl Thanks! If I understood you correctly, I can use RazerS3 on the T1K-generated _candidate_1/2.fq? I am asking because I don't think Optitypes works with cram files and going from cram to bam will generate files that are too large and probably time-consuming.

mourisl commented 3 weeks ago

Yes, I think RazerS3 on the candidate_1/2.fq should be fine.