Open alicekiki opened 9 months ago
What is your data type, is it WGS, WES or RNA-seq? What was your running command? A recent preprint also noticed this missing of 3DP1 and 2DL4 in about 6% haplotypes (https://www.biorxiv.org/content/10.1101/2023.11.12.566753v1). How many samples do you see missing 3DP1 or 2DL4? Thank you.
Thank you for ur reply. My data type is WGS, the running command is: ~/T1K/run-t1k -b "$b.cram" -c ~/T1K/kiridx/kiridx_dna_coord.fa -f ~/T1K/kiridx/kiridx_dnaseq.fa --preset kir-wgs -t 8 --abnormalUnmapFlag -o kir/"kir$b" There are 621 samples, which 320 of them don't have 3DP1 in the genotyping results and 3 of them don't have 2DL4.
Is your alignment file based on hg38 or CHM13? Can you show me the genotype results from one of the samples? Thank you.
My data comes from biobank, According to the UK Biobank FAQ, the CRAM files downloaded from the biobank WGS data are already aligned to the GRCh38 reference genome using the OQFE protocol, so i didn't do extra alignment. Here is one of the samples which dont have 3DP1: ( t1k_genotype.tsv+t1k_allele.tsv) Thank you~
Based on the abundance values, I think your WGS data has coverage around 20x. This coverage might be a bit low, as we can see alleles from KIR3DS1 is not distinguishable. Maybe a prefilter of removing lowly covered samples in your data set will give better overall genotype results.
Thank you for ur advice, i will check it. 😊
Hi~ I have checked coverage for those samples which don't show 3DP1, for most samples the coverage is around 30x , here is their coverage plot: Take one sample which has 55x covarage, here is its genotype result:
I would like to know is it the coverage issue or is there any other explanation for this phenomenon? Thank you for your time😊~
Indeed. Which version of T1K did you use? I recently fixed several issues in the kiridx_dna file, could you please regenerate the kir_dna_seq.fa using t1k-build and give this sample a try? The allele prediction might be wrong, but the gene presence is usually right. For those samples, do you have WES data for some of them?
Hi~Thanks for ur reply. I did the analysis in August, the initial version i think. I don't have WES data for those samples. I just used the newset release 1.0.5 and do it again for this sample, here is the genotype result:
Thank you for sharing the results. I think this particular sample may not have 3DP1..
One another thing you can check is whether your CRAM contains unmapped reads, and what are alternative contigs names in it. Thank you.
Hi~ I checked GRCh38 and it indeed contains alternative contigs and there are unmapped reads. I would like to know if you have any suggestion on how to deal with this matter? I really appreacite your advice. Thank you for ur time😄~
Could you please show me the names of these alterantive contigs? T1K also uses reads mapped to alternative contigs and unmapped reads for genotyping, but some reference has very strange alternative contig names, and T1K may miss those.
Here contains the alt contigs for the reference. https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_report.txt I also viewed the alignments with IGV, and i looked into the 3DP1 genomic location, it doesnt seem to have alt-aware issue. Here is an alternative location of 3DP1 and it indeed has mapping quality issue. I would like to know ur inspects on this matter. Thank you~
Many KIR gene regions are homologous to each other, so there will be many multimapped reads, and their MAPQs will be 0. What is the read length of your data?
Could you please try to run T1K by replacing the option "--kir-wgs" with " -s 0.8 --relaxIntronAlign"? This will allow more mismatches in the alignment. It may lower the accuracy of allele calling, but at least we can check whether 3DP1 can be recovered.
It is 151 bp. I used the above command and ran the sample above, it showed 3DP1 but a very low quality score.
I ran with another sample which lacks of 3DP1, it still doesn't have 3DP1.
Is your data single-end or paired-end 151bp? For example, for the intermediate output from T1K, do you see sample_candidate.fq or sample_candidate_1.fq and sample_candidate_2.fq?
It's very hard to explain. One potential error could come when extract candidate reads from the CRAM file, such as the reference download is incomplete and the candidate reads are only partial.
You may consider try to convert the CRAM file to fastq file, i.e. https://github.com/mourisl/T1K_manuscript_evaluation/tree/master/HPRC_process , and then try T1K.
Another way is try another genotyper, such as PING. But they don't support CRAM file, so I guess you can try to use candidate{1,2}.fq as the input.
Thank you for the advice, i will try it 😊~
Hi, I would like to update the issue I encountered with T1K. I found that converting the cram file to fastq file resolved the problem of missing 3DP1 in the genotyping results. This suggests that there is something in the cram format that interferes with the T1K method. I hope this information is useful for you and other users of T1K. Thank you for your advice and your wonderful work. Happy holidays 😊~
这里是陈衍潮,已收到你的邮件
Thank you for the update! I will look into this issue, and it may be also relevant to BAM input. Happy holidays!
Hi @mourisl! I've managed to convince people in my lab to use T1K to infer HLA types. However, someone wants to see results from Optitype, as it's what they've heard of. My question is, is it fair to use the T1K-generated _candidate_1/2.fq file as input for Optitype? Optitype itself suggests using RazerS3. Would it be a fair comparison between T1K and Optitype using the T1K-generated _candidate_1/2.fq?
这里是陈衍潮,已收到你的邮件
@Bondada20 Depends, the candidate reads are selected using very relaxed criteria. So these two files include HLA reads and many other reads that may distantly look like HLA reads. So you may still need to use RazerS3 to do Optitype's mapping to find the exact set of HLA reads that Optitype may desire.
@mourisl Thanks! If I understood you correctly, I can use RazerS3 on the T1K-generated _candidate_1/2.fq? I am asking because I don't think Optitypes works with cram files and going from cram to bam will generate files that are too large and probably time-consuming.
Yes, I think RazerS3 on the candidate_1/2.fq should be fine.
Hi~ Thank you for this wonderful work, i would like to know about an issue i came across. KIR3DL3, KIR3DP1, KIR2DL4, and KIR3DL2 are four frame kir genes which in principle every individual should have, however in my analysis there are many individuals dont have 3DP1 or 2DL4 in genotyping results, and i dont find genotype ID for some samples from(http://www.allelefrequencies.net/), i would like to ask if it is normal or if i miss any parameter during my analysis?