marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

Problems in gff #52

Open mpovidlov1 opened 2 years ago

mpovidlov1 commented 2 years ago

I was looking at the gene annotation files, in particular, http://courtyard.gi.ucsc.edu/~mhauknes/T2T/t2t_Y/annotation_set/CHM13.v2.0.gff3 It looks like the file contains multiple problems, mostly touching exons with introns of size 0. I can send examples

mpovidlov1 commented 2 years ago

@snurk ?

skoren commented 2 years ago

The annotations come from liftoff/CAT so this is more a question for @mhaukness-ucsc or @diekhans Are these similar to issues asked in #31 and #37?

mpovidlov1 commented 2 years ago

Thanks. The other issues mention other problems with earlier versions of the annotation files. Mine is quite specific. The records define exons like this (start end): 100 200 201 300

which means that the intron between them is of size 0

mhaukness-ucsc commented 2 years ago

Hi @mpovidlov1, could you please provide some examples? I think this is likely a result of errors present in the original GENCODE annotations, but I'll look into it.

mpovidlov1 commented 2 years ago

Here is an example of the first problematic gene, starts on line 12:

[problems.txt](https://github.com/marbl/CHM13/files/8455851/problems.txt)
 111903112896transcript-
 111903112498exon-
 111940112498CDS-
 111940111942stop_codon-
 112499112896exon-
 112499112877CDS-
 112875112877start_codon-

I have a list of more than 200 problematic genes referenced by line number (attached)

diekhans commented 1 year ago

Issue moved to CAT repo: https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/285