Open mpovidlov1 opened 2 years ago
@snurk ?
The annotations come from liftoff/CAT so this is more a question for @mhaukness-ucsc or @diekhans Are these similar to issues asked in #31 and #37?
Thanks. The other issues mention other problems with earlier versions of the annotation files. Mine is quite specific. The records define exons like this (start end): 100 200 201 300
which means that the intron between them is of size 0
Hi @mpovidlov1, could you please provide some examples? I think this is likely a result of errors present in the original GENCODE annotations, but I'll look into it.
Here is an example of the first problematic gene, starts on line 12:
111903 | 112896 | transcript | - | |
111903 | 112498 | exon | - | [problems.txt](https://github.com/marbl/CHM13/files/8455851/problems.txt)|
111940 | 112498 | CDS | - | |
111940 | 111942 | stop_codon | - | |
112499 | 112896 | exon | - | |
112499 | 112877 | CDS | - | |
112875 | 112877 | start_codon | - |
I have a list of more than 200 problematic genes referenced by line number (attached)
Issue moved to CAT repo: https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/285
I was looking at the gene annotation files, in particular, http://courtyard.gi.ucsc.edu/~mhauknes/T2T/t2t_Y/annotation_set/CHM13.v2.0.gff3 It looks like the file contains multiple problems, mostly touching exons with introns of size 0. I can send examples