jorvis / biocode

Bioinformatics code libraries and scripts
MIT License
504 stars 247 forks source link

Update convert_tRNAScanSE_to_gff3.pl #68

Closed pgonzale60 closed 3 years ago

pgonzale60 commented 3 years ago

Bedtools v2.29.2 did not recognize this gff3. Using cat -t showed that there are some white spaces just after a coordinate and the tab. Adding this line allowed resulted in a gff3 successfully processed by bedtools. However, I'm not sure if removing all white spaces could have secondary effects (e.g. for contig names that do contain white spaces).

jorvis commented 3 years ago

Thanks for this report. As you said, I think stripping spaces from the entire line will certainly break some inputs. Why not just strip the columns in question after the split command instead?

pgonzale60 commented 3 years ago

I agree. In addition to just modifying the coordinate columns I'm also modifying the feature colum. In particular, I'm removing the exon feature as specified here. I will open a new pull request after testing it with another genome.

jorvis commented 3 years ago

OK, but the exon should only be removed for non-coding RNA features as that link shows. For regular mRNAs there certainly should still be an exon.

And the link you gave was from NCBI, who only very recently started supporting GFF3. It doesn't fully match the GFF3 spec but because it's NCBI I suspect we'll all have to switch to how they're encoding things vs. how they were actually specified to be. Some scripts here in biocode won't work with some aspects of NCBI's GFF3 annotations quite yet.