About extracting CDS and PEP from GFF files obtained from TOGA

SWei2333 commented 3 months ago

Hi Recently, I've been using annotation files for some species obtained via the TOGA method from Zoonomia. When I attempted to extract CDS and PEP from these annotation files and genomes, I found that more than half of the CDS sequences extracted using itools were not multiples of 3. Upon inspecting these sequences, I discovered that half of them had incomplete stop codons, with only one or two bases remaining. The other half ended with TGA but were still not multiples of 3. The reason for this is currently unknown.

Do you have any suggestions for this issue? Thank you very much.

Best wishes

MichaelHiller commented 3 months ago

Can you pls extract only projections (transcript.gene.chainID) that are classified as intact or partially intact? This info is in loss_summ_data.tsv.gz (https://genome.senckenberg.de/download/TOGA/README.txt).

Other projections have frameshifts or stop codons or missing sequences and this may not translate.

SWei2333 commented 3 months ago

Hi I compared the genes with not multiples of 3 to the loss_summ_data.tsv.gz file and found that a significant portion of the genes marked as "I" still have CDS sequences that are not multiples of 3, primarily due to incomplete stop codons.

MichaelHiller commented 3 months ago

Can you send an example? We have all the data also in the genome browser http://genome.senckenberg.de/ Pls type a projection into the browser of the query species and click on the TOGA annotation. This shows the protein alignment and a list of inactivating mutations.

Send me one, I can have a look as well

SWei2333 commented 3 months ago

Thank you very much for your suggestions. While organizing my files, I suddenly realized a potential issue. The GTF files I downloaded from Zoonomia contain separate lines for start_codon and stop_codon. However, when converting them to GFF format using gffread, an error occurred, causing some bases of the stop codons to not be included in the CDS regions, resulting in many sequences not being multiples of 3. Thank you very much for your help. I will correct this error and try again. If the problem persists, I will seek your advice again. Thank you very much.

SWei2333 commented 3 months ago

Hi, I recently downloaded the genomes and corresponding TOGA.gtf files for several species from Zoonomia. I used gffread to extract the protein sequences (pep) from the GTF files and performed BUSCO scoring. I found that the scores were generally around 80%. Could the issue be with the way I extracted the protein sequences, or is it due to the inherent quality of the genomes resulting in GTF files of this level? below is a result

Thank you very much.

MichaelHiller commented 3 months ago

For which assembly are these stats? And how does that compare to the BUSCO stats we have in the supplement tables?

SWei2333 commented 3 months ago

I downloaded the GTF file for Dinomys branickii using the mouse as the reference. In your table, the annotation of this species with the human as the reference has a BUSCO score of 9,025 97.82% 86 115（Complete (s + d) Complete% Fragmented Missing）

MichaelHiller commented 2 months ago

This is the HLdinBra1 DNAzoo assembly, right? Of course, mouse or human as the reference will make a difference, but can't explain 97.8% vs. 76% completeness.

Can you pls download the protein alignment file and extract only QUERY sequences and BUSCO this? https://genome.senckenberg.de/download/TOGA/human_hg38_reference/Rodentia/Dinomys_branickii__pacarana__HLdinBra1/proteinAlignments.fa.gz This should reproduce the BUSCO values we report.