marbl / CHM13

The complete sequence of a human genome
Other
908 stars 98 forks source link

cannot get accurate protein sequences from the gff file #37

Open ATPs opened 2 years ago

ATPs commented 2 years ago

I tried to extracted the cds sequences from the gff file.

gffread -g chm13.draft_v1.1.fasta -x cds.fa chm13.draft_v1.1.gene_annotation.v4.gff3

however, when trying to translate the cds to proteins, the open reading frame is not correct for quite many sequences. Is there a way to download the predicted protein sequences?

mhaukness-ucsc commented 2 years ago

Hi @ATPs ,

I created a file with the predicted protein sequences here that you can use: http://courtyard.gi.ucsc.edu/~mhauknes/T2T/chm13.draft_v1.1.gene_annotation.protein.fasta

mhaukness-ucsc commented 2 years ago

These incorrect open reading frames are to be expected from the GENCODE annotation (they aren't errors). For example, many of the transcripts in GENCODE have tags like cds_end_NF and cds_start_NF which are fragments that are annotated (probably from ESTs) but have a lack of sufficient evidence. These are propagated down into our gene annotations. You can ignore any transcripts with the tag proper_orf=False in the gff3 if you want to include only transcripts with full, proper ORFs.