mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
52 stars 8 forks source link

AddGeneCoord.pl fails to populate HLA-HFE coordinates #30

Closed dmiller15 closed 5 months ago

dmiller15 commented 5 months ago

I've been testing out the software, and I noticed a discrepancy in HLA-HFE between using a BAM and FASTQ input. Where the FASTQ input would report high abundance and quality, the BAM input would report nothing. I looked through all the read assignments from the FASTQ results, and each one maps to the HFE gene on chr6: https://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000010704;r=6:26087226-26098343.

In your manuscript HLA files (https://github.com/mourisl/T1K_manuscript_evaluation/blob/master/hlaidx_3_44_0.tar.gz) as well as ones I made following the documentation directions, HLA-HFE receives no mapping in the coordinate files:

>HLA-HFE*001:01:01 chr19 -1 -1 +
>HLA-HFE*001:01:02 chr19 -1 -1 +
>HLA-HFE*001:01:03 chr19 -1 -1 +

The reason for this lack of mapping is that the GENCODE and Ensembl GTFs just refer to this gene as HFE. AddGeneCoord.pl was looks for exactly HLA-HFE, comes up with no matches, and leaves the unmapped default.

When I manually alter the coordinates file to have the HLA-HFE mapping match the Ensembl HFE coordinates, the BAM and FASTQ runs agree on the abundance/quality.

To summarize:

I don't think this is an issue for any other HLA contig. The only others that remain unmapped are HLA-DRB3, HLA-DRB4, and HLA-Y. As far as I can tell none of these has a corresponding mapping in the genome.

mourisl commented 5 months ago

Thank you very much for finding this problem. I'll check whether there are annotations containing HLA-H/F/E, or create our own list of gene coordinates and use that as the input for AddGeneCoord.pl.

mourisl commented 5 months ago

Thank you for finding this issue. I just added an option "--gtf-gene-name-mapping" to the AddGeneCoord.pl script. Its default value is "HFE:HLA-HFE" and we can use comma-split string to represent other gene name mappings. This will internally map the gene name in the GTF to the name specified by the user. Hope this can help resolve this issue.

dmiller15 commented 5 months ago

Thanks for the quick response. I am now seeing coordinates for HLA-HFE.