mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 7 forks source link

T1K for PGx #23

Open nbiesot opened 9 months ago

nbiesot commented 9 months ago

Hi,

I am trying to use T1K for PGx, following the step-by-step plan described in the vcf_database. Unfortunately, I am not getting the expected results for my samples (for example, I get for the CYP2D6 gene, 4/86 as output, where I expect 1/4).

This is the case for both the reference file I created for CYP2D6 according to the step-by-step plan and the reference files in the cyp2d6_idx folder on Git.

What could be possible reasons for not getting the expected outputs?

(The data I am using is from the Genetic Testing Reference Material Coordination Program (GeT-RM). These reference materials contain mutations of clinical importance that have been confirmed by multiple volunteer laboratories using different testing platforms, including for the CYP2D6 gene.)

mourisl commented 9 months ago

Do you mean you did not get CYP2D6*1 series in the output? Could you please share the .dat generated from the procedure? Thank you.

nbiesot commented 9 months ago

Yes, indeed. cyp2d6.txt

(I couldn't upload the .dat file, it was not supported)

mourisl commented 9 months ago

The txt file looks fine, and I can generate the reference fasta files containing the CYP2D61 or CYP2D61.XXX . So for the 4/86 and 1/4 is the genotyping results?

One possible reason is that CYP2D6 is highly homologous to CYP2D7, and you may need to put in some CYP2D7 gene sequences in the reference.

nbiesot commented 9 months ago

Thank you for looking into the file! CYP2D6 is not the only gene I have looked at; I have also examined CYP2C9, CYP2C19, CYP3A5, and CYP4F2. For these genes as well, I do not get the expected output for the 16 samples I tested. If the .dat file looks good, is there another possibility for why I am not getting the expected output for these other genes?

mourisl commented 9 months ago

Can you show me your running commands and your genotype.tsv file? Is your data RNA-seq or other sequencing platform?

nbiesot commented 9 months ago

The WGS files are available at: https://www.ebi.ac.uk/ena/browser/view/ERR1955327 The command I am using is: run-t1k -f T1K/vcf_database/cyp2d6_idx/cyp2d6_dna_seq.fa -1 ERR1955327_1.fastq.gz -2 ERR1955327_2.fastq.gz --od ERR1955327/cyp2d6 --alleleDigitUnits 1 --alleleDelimiter . -t 16 The output that results from this is: T1K_ERR1955327_1_genotype.ods

Thank you very much for your effort.

mourisl commented 9 months ago

I would recommend concatenating all the dna_seq.fa from cyp genes into a combined fasta file. This way it may resolve reads that are aligned to multiple cyp genes. Another parameter to tune is the "-s" option, the default 0.8 might be to lenient. You may consider trying values like 0.9 and 0.97.