marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

Missing parent gene entry for EPHA2 #92

Open CodeCheong opened 7 months ago

CodeCheong commented 7 months ago

Dear T2T-CHM13 team,

I have been scanning through the UCSC Gencode v35 gff3 annotation file and found that the gene EPHA2 lacked a parent gene entry. The annotation file contains transcript, exon, CDS, stop_codon and intron level entries for this gene, but ultimately lacked the parent gene entry. May I ask if there is any way that I could obtain it's gene level quantification using this gff3 document specifically? This is because I have already mapped my samples using this version of the annotation. Thanks for the help.

diekhans commented 6 months ago

This is a bug in the GFF3 file generation. There is a gene record with the wrong gene_name value. In the EPHA2 case, the gene consists of a mix of transcripts that were projected from human and ones predicted by StringTie long-read transcript models. The StringTie transcripts are assigned names like MSTRG.59. These names were not changed to match the other transcripts in the locus, and the gene record now has the name MSTRG.59.

There are several cases of this in the GFF3, which will take some time to fix.

I am attaching a file with a report of the mismatched gene names, which you might be able to use to generate a script to edit the GFF3 gene_annotation_probs.tsv.gz

diekhans commented 6 months ago

Here is an updated version that corrects the gene names. Please give it a try, we will update the web site soon

https://hgwdev.gi.ucsc.edu/~markd/t2t/cat-update/catLiftOffGenesV1.1.gff3.gz