mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 6 forks source link

Problems with KIR2DL5A/B #8

Open fernandogs97BR opened 1 year ago

fernandogs97BR commented 1 year ago

Thank you for creating this valuable tool. We've found it to be the best for analyzing our data compared to other tools, but we've noticed a discrepancy in genotyping the KIR2DL5A/B genes. While other tools like KPI consider it as one gene, your tool identifies them as separate genes, leading to a mismatch. Is it possible to modify the code to report if at least one of the KIR2DL5 genes is present or not?

mourisl commented 1 year ago

KIR2DL5A and KIR2DL5B are homologous, but should still be two different genes, so we report them separately. If you want to put them together, one way is to change the gene names KIR2DL5A and KIR2DL5B to KIR2DL5 in the "kiridx_{rna,dna}_seq.fa" reference file. I think IPD-KIR also notice that they are hard to distinguish, so their allele numberings are separate. Therefore, you only need to change the gene name.

mourisl commented 1 year ago

As a related issue, for the mismatched gene, is its quality score 0?

fernandogs97BR commented 1 year ago

Thank you for your prompt response. I appreciate your suggestion to use the kiridx_dna_seq.fa file, if fact I've already tried that without good results. However, I am still experiencing issues with predicting the KIR2DL5 genotype correctly. It appears that the mismatched gene quality score is consistently greater than 30 in all cases.

mourisl commented 1 year ago

Could you please be more specific about the error after modifying kiridx_dna_seq.fa file? Is it that T1K still predict the presence of KIR2DL5 while KPI says the gene is missing? Could you please paste the output of the t1k_genotype.tsv and KPI's output here? Thank you.

mourisl commented 1 year ago

Hi @fernandogs97BR, I just want to follow up on this issue. Could you please share more information on the new results? Thank you.

fernandogs97BR commented 1 year ago

Hello Li Song, sorry for the late reply, all the problems we are having with KIR2DL5 genes are false positives results. Here for example, KIR2DL5B gene should be negative (tested with MLPA) and it appears to be positive with your pipeline. Even so, KPI reports a negative result for KIR2DL5A/B. As you can see in your pipeline KIR2Dl5B gene has a quantification of 155 with a quality of 45. It exceeds the quality cutoff 30 to consider a positive gene (according to our criteria when making a comparison with about 50 samples). [image: imagen.png] Thank you in advance, Fernando.

El mar, 28 mar 2023 a las 21:19, Li Song @.***>) escribió:

Hi @fernandogs97BR https://github.com/fernandogs97BR, I just want to follow up on this issue. Could you please share more information on the new results? Thank you.

— Reply to this email directly, view it on GitHub https://github.com/mourisl/T1K/issues/8#issuecomment-1487474094, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5OCRKHBW3RGCLOSHHVTD6LW6M2VTANCNFSM6AAAAAAWCIUFAA . You are receiving this because you were mentioned.Message ID: @.***>

-- Fernando Gálvez, Bioinformatician at R&D Department, Journey Genomics, S.L.U. Parque Científico y Empresarial de la UMH Edificio Quorum III Avenida de la Universidad s/n 03202 Elche (Alicante) Spain

Tlf: +34 966 261 268 Fax: +34 965 459 422 www.journeygenomics.com


*AVISO LEGAL: Este mensaje y sus archivos adjuntos van dirigidos exclusivamente a su destinatario, pudiendo contener información confidencial sometida a secreto profesional. No está permitida su comunicación, reproducción o distribución sin la autorización expresa de JOURNEY GENOMICS S.L.U. Si usted no es el destinatario final, por favor elimínelo e infórmenos por esta vía.PROTECCIÓN DE DATOS: De conformidad con lo dispuesto en el Reglamento (UE) 2016/679 de 27 de abril de 2016 (GDPR), le informamos que los datos personales y dirección de correo electrónico, recabados del propio interesado o de fuentes públicas, serán tratados bajo la responsabilidad de JOURNEY GENOMICS S.L.U. para el envío de comunicaciones sobre nuestros productos y servicios y se conservarán mientras exista un interés mutuo para ello. Los datos no serán comunicados a terceros, salvo obligación legal. Le informamos que puede ejercer los derechos de acceso, rectificación, portabilidad y supresión de sus datos y los de limitación y oposición a su tratamiento dirigiéndose a AVENIDA DE LA UNIVERSIDAD - EDIF QUORUM III, S/N 03202 ELCHE-ELX (ALICANTE). Email: @. @.>. Si considera que el tratamiento no se ajusta a la normativa vigente, podrá presentar una reclamación ante la autoridad de control en www.agpd.es http://www.agpd.es/. LEGAL NOTICE: This message and any attachments are intended exclusively for their destination and may contain confidential information subject to professional secrecy. Is not permitted their communication, reproduction or distribution without the express written permission of JOURNEY GENOMICS S.L.U. If you are not the intended recipient, please delete and inform us by this route.DATA PROTECTION: According to the Regulation (EU) 2016/679 of 27 April 2016 (GDPR), we inform you that the personal information and email address, obtained from the data subject or of public sources, will be part of a file of JOURNEY GENOMICS S.L.U., with the finality of it send communications about our products and services and will be preserved while there is mutual interest to do so. The data will not be communicated to third parties, except for legal obligation. If you wish, you may exercise your rights of access, rectification, portability and delete data and limitation and opposition to their treatment by contacting the address AVENIDA DE LA UNIVERSIDAD - EDIF QUORUM III, S/N 03202 ELCHE-ELX (ALICANTE). Email: @. @.> . If you consider that the treatment does not comply with current regulations, you can submit a claim to the control authority at www.agpd.es http://www.agpd.es/.*

mourisl commented 1 year ago

Hi @fernandogs97BR , thank you for sharing more information. But GitHub could not show the image through the email reply...

fernandogs97BR commented 1 year ago

Sorry, here is are the results: KIR2DL1 | 1 | KIR2DL1003 | 437,132 | 60 | . | 0 | −1 KIR2DL2 | 1 | KIR2DL2003 | 437,27 | 60 | . | 0 | −1 KIR2DL3 | 1 | KIR2DL3001 | 358,796 | 60 | . | 0 | −1 KIR2DL4 | 2 | KIR2DL4008 | 514,073 | 60 | KIR2DL4001 | 255,612 | 60 KIR2DL5A | 0 | . | 0 | −1 | . | 0 | −1 KIR2DL5B | 1 | KIR2DL5B004 | 109,284 | 50 | . | 0 | −1 KIR2DP1 | 1 | KIR2DP1002 | 530,15 | 60 | . | 0 | −1 KIR2DS1 | 1 | KIR2DS1006 | 2,83862 | 0 | . | 0 | −1 KIR2DS2 | 2 | KIR2DS2001 | 296,592 | 60 | KIR2DS2006 | 74,7484 | 2 KIR2DS3 | 2 | KIR2DS3001 | 14,9911 | 0 | KIR2DS3010 | 3,59657 | 0 KIR2DS4 | 2 | KIR2DS4003 | 397,027 | 60 | KIR2DS4001 | 348,445 | 60 KIR2DS5 | 1 | KIR2DS5002 | 0,80732 | 0 | . | 0 | −1 KIR3DL1 | 2 | KIR3DL1002 | 392,655 | 60 | KIR3DL1001 | 366,179 | 60 KIR3DL2 | 2 | KIR3DL2001 | 428,362 | 60 | KIR3DL2002 | 383,342 | 60 KIR3DL3 | 2 | KIR3DL3019 | 409,845 | 60 | KIR3DL3008 | 333,898 | 60 KIR3DP1 | 2 | KIR3DP1003 | 300,752 | 60 | KIR3DP1001 | 116,751 | 43 KIR3DS1 | 1 | KIR3DS1078 | 13,1085 | 0 | . | 0 | −1

mourisl commented 1 year ago

It seems there are a lot of reads covering each gene (109,284 FPK for KIR2DL5B?), is it KIR-targeted amplified data? Could you please try a higher value for the option "--crossGeneRate", maybe 0.08?

Just curious, what is the KPI output look like on this dataset? Thank you.

fernandogs97BR commented 1 year ago

I apologize for the delay in responding. I have conducted some tests, but with little success. I have already incorporated the parameter "-s 1" to improve the results (although the KIR2DL5A/B genes still come out incorrectly in many cases) and using the suggestion you mentioned produces the same result. Only by increasing the parameter to 0.4 do the KIR2DL5A/B genes appear negative, but then genes that previously appeared as positive become negative. I am using targeted KIR data sequenced with IonTorrent. This is an example of a both negative KIR2DL5A/B genes "id haplotypes 3DL3 2DS2 2DL2 2DL3 2DP1 2DL1 3DP1 2DL4 3DL1 3DS1 2DL5 2DS3 2DS5 2DS4 2DS1 3DL2 IonXpress_020_prediction.txt uninterpretable Y Y N N Y Y N Y Y N N N N Y N Y"

mourisl commented 1 year ago

Could you please repaste the output from T1K again, some of the abundance numbers look strange such as "KIR2DS5 | 1 | KIR2DS5002 | 0,80732 | 0 | . | 0 | −1"? I think another way is to pick a hard threshold.

fernandogs97BR commented 1 year ago

Since the last example, I have tested many different things... so I do not have that exact example but here you have an example of a KIR2DL5B negative gene sample. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Tables

KIR2DL1 | 2 | KIR2DL1*037 | 732,443 | 60 | KIR2DL1*002 | 712,462 | 60 -- | -- | -- | -- | -- | -- | -- | -- KIR2DL2 | 0 | . | 0 | −1 | . | 0 | −1 KIR2DL3 | 2 | KIR2DL3*002 | 855,96 | 60 | KIR2DL3*001 | 696,856 | 60 KIR2DL4 | 2 | KIR2DL4*008 | 1220,3 | 60 | KIR2DL4*011 | 1137,62 | 60 KIR2DL5A | 0 | . | 0 | −1 | . | 0 | −1 KIR2DL5B | 1 | KIR2DL5B*002 | 109,237 | 32 | . | 0 | −1 KIR2DP1 | 2 | KIR2DP1*002 | 841,24 | 60 | KIR2DP1*003 | 657,861 | 60 KIR2DS1 | 0 | . | 0 | −1 | . | 0 | −1 KIR2DS2 | 1 | KIR2DS2*006 | 1,48262 | 0 | . | 0 | −1 KIR2DS3 | 1 | KIR2DS3*001 | 73,7934 | 0 | . | 0 | −1 KIR2DS4 | 2 | KIR2DS4*003 | 979,654 | 60 | KIR2DS4*010 | 653,884 | 60 KIR2DS5 | 2 | KIR2DS5*034 | 1,12183 | 0 | KIR2DS5*007 | 0,54135 | 0 KIR3DL1 | 2 | KIR3DL1*001 | 968,198 | 60 | KIR3DL1*005 | 744,407 | 60 KIR3DL2 | 2 | KIR3DL2*010 | 840,313 | 60 | KIR3DL2*005 | 321,341 | 60 KIR3DL3 | 2 | KIR3DL3*001 | 967,347 | 60 | KIR3DL3*002 | 945,338 | 60 KIR3DP1 | 2 | KIR3DP1*003,KIR3DP1*049 | 480,499 | 60 | KIR3DP1*042 | 216,534 | 60 KIR3DS1 | 1 | KIR3DS1*013 | 1,31985 | 0 | . | 0 | −1
mourisl commented 1 year ago

KIR2DL5B has a lot of read coverage, so it might be real? Since the targeted sequencing has very deep coverage, I think you can also use the max quality score 60 as the threshold.

fernandogs97BR commented 1 year ago

I have tried adjusting the threshold and the results have improved, but there are still some mismatches. I also have a question about genotyping the ABO gene. If I create a reference file for the ABO gene, similar to the KIR or HLA reference files, by adding the different known mutations to the reference ABO sequence, would it be possible to use this pipeline for genotyping the ABO alleles?"

mourisl commented 1 year ago

Yes. You can also use options like "--alleleDigitUnits", "--alleleDelimiter" to create your own nomenclature and genotyping level. Is the sequence from DNA or RNA? If it is DNA, you may need to specify the exon regions in the EMBL-ENA file. We have a tutorial on creating T1K references from VCF files: https://github.com/mourisl/T1K/tree/master/vcf_database, hope this could be helpful.