molevol-ub / bitacora

BITACORA: A Bioinformatics tool for gene family annotation
GNU General Public License v3.0
42 stars 14 forks source link

Errors in the start and end positions of some sequences in the output gff file cause frameshift mutation #15

Open fangrunzhao opened 1 month ago

fangrunzhao commented 1 month ago

Hi Vizueta,

Thank you for developing a favorable tool for gene family annotation, but now I have some problems.

Some sequences of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file will have the wrong open reading frame.

Here is part of this *_genomic_and_annotated_proteins_trimmed_idseqsclustered.gff3 file:

_chr001 AnnotGFF gene 12515613 12515776 . - . ID=_chr001.341;blastphmmer;annot;Pos:1-54 _chr001 AnnotGFF mRNA 12515613 12515776 . - . ID=_chr001.341.1;Parent=_chr001.341;blastphmmer;annot;Pos:1-54 _chr001 AnnotGFF CDS 12515613 12515773 . - 2 ID=_chr001.341.1_cds_1;Parent=_chr001.341.1;blastphmmer;annot;Pos:1-54

The following is part of the original genome gff file (the input gff file needed to run):

_chr001 . gene 12515609 12515775 . - . ID=_chr001.341 _chr001 . mRNA 12515609 12515775 . - . ID=_chr001.341.1;Parent=_chr001.341 _chr001 . exon 12515609 12515775 . - . ID=_chr001.341.1_exon_1;Parent=_chr00> _chr001 . CDS 12515609 12515775 . - 2 ID=_chr001.341.1_cds_1;Parent=*_chr001>

You can see that these two identical sequences have different start and end positions. This will result in errors when extracting proteins or CDS using gffread, such as frameshift mutation. I think this is caused by the program after trimming the sequence, is this a bug please?

Best, Run Zhao Fang

Vizueta commented 1 month ago

Hi again,

I need more insights to see if it is a bug or a problem with your input gff, which might be corrected by BITACORA and thus the difference in the positions.

Can you check that the predicted protein sequence (in the fasta file), is the same or different from the protein in your original annotation?

Also, using the new BITACORA GFF3 file, have you tried to predict the protein sequence using "gffread", if yes, does the resulting protein have the incorrect reading frame resulting in stop codons in the sequence?

Finally, I use the following script to extract protein sequences from the gff perl Scripts/Tools/gff2fasta_v3.pl genome.fasta anotation.gff OutPrefix

Could you see if the correct reading frame is extracted from your input gff and BITACORA gff file?

Best, Joel

fangrunzhao commented 1 month ago

Hi Vizueta,

I checked the predicted protein sequence and some of the protein sequence output by BITACORA is different from the original annotated sequence. Specifically, some of the protein sequences are shorter than the original protein sequences, perhaps trimmed? such as,

predicted fasta file

rna-XM_042113013.1 QSVLGEKGRRIRELTSVVQKRFNIPENSVDLYAEKVATRGLCAIAQAESLRYKLIGGLAVRRACYGVLRFIMESGARGCEVVVSGKLRGQRAKSMKFVDGLMIHSGDPCNDYVNTATRHVLLRQGVLGIKVKIMLPWDQQGKNGPKKPQPDHILVTEPKDEPVPLEPTSDVRSIAPVPIPQAVPAVA original annotation fasta file rna-XM_042113013.1 MAVNNISKKRKFVGDGVFKAELNEFLTRELAEDGYSGVEVRVTPTKSEIIIMATRTQSVLGEKGRRIRELTSVVQKRFNIPENSVDLYAEKVATRGLCAIAQAESLRYKLIGGLAVRRACYGVLRFIMESGARGCEVVVSGKLRGQRAKSMKFVDGLMIHSGDPCNDYVNTATRHVLLRQGVLGIKVKIMLPWDQQGKNGPKKPQPDHILVTEPKDEPVPLEPTSDVRSIAPVPIPQAVPAVA

I used gffread to extract proteins and CDS from BITACORA GFF file and this extracts the sequences successfully. However, some of the protein sequences have incorrect reading frame resulting in stop codons within the sequence. I looked at the extracted CDS sequences and some of the CDS sequences are not multiples of 3 in length. Therefore, I compared the BITACORA gff file with the original GFFfile and as mentioned earlier the errors.

I have also used the script you mentioned to extract protein sequences and CDS sequences, which likewise will have stop codons in the protein sequences. However, the protein sequences extracted using the script and gffread do not have the same error sequences.

Best, Run Zhao Fang

fangrunzhao commented 1 month ago

Here's the data I used for testing. my_test_data.zip