reversing of reverse strand features

MichaelFokinNZ commented 6 years ago

Hi Jason, maybe you can advise using something better than exonerate for protein2genome mapping, but... since I am using it - I've found a very strange behavior of your perl script, I can't explain.

pretty standard/recommended exonerate parameters exonerate --model protein2genome --percent 95 --showtargetgff yes --showalignment no --targetchunkid ${SLURM_ARRAY_TASK_ID} --targetchunktotal 16 --ryo ">%qi length=%ql alnlen=%qal\n>%ti length=%tl alnlen=%tal\n" --querytype protein --query proteins.fasta --targettype dna --target genomic.fna > exonerate_chunk${SLURM_ARRAY_TASK_ID}.output

the GFF part of output looks correct (checked that), but after using process_exonerate_gff3.pl perl process_exonerate_gff3.pl -t Protein exonerate_chunk8.output exonerate_chunk9.output > exonerate_test89.gff3

the reverse strand genes became reversed (as in example attached) and in Geneious they come to completely wrong positions. It is default/normal behaviour of the script?

MichaelFokinNZ commented 6 years ago

2jason.txt

MichaelFokinNZ commented 6 years ago

UPD. Of course :) nothing wrong with process_exonerate_gff3.pl - it seems to be all about how Geneious loads the (-) strand features...

hyphaltip commented 6 years ago

Okay. Glad it worked. Since this is just a deposit for my working scripts I don’t have unit tests for these but originally I did have test cases this was validated against.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Aug 7, 2018, 5:03 PM -0700, MichaelFokinNZ notifications@github.com, wrote:

UPD. Of course :) nothing wrong with process_exonerate_gff3.pl - it seems to be all about how Geneious loads the (-) strand features... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

MichaelFokinNZ commented 6 years ago

sorry, still not quite sure with gff (-) strand coordinates in the process_exonerate_gff3.pl output. could you please check in the file attached above, if it is a normal transformation of exonerate output by the script. the genes there are from the reference well-know cluster, so I can check that they were annotated correctly on gff being taken straight from exonerate, but not after the process_exonerate_gff3.pl concatenation/filtering.... :(

hyphaltip commented 6 years ago

I’ll try. Not much time this week to do this.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Aug 7, 2018, 5:27 PM -0700, MichaelFokinNZ notifications@github.com, wrote:

sorry, still not quite sure with gff (-) strand coordinates. could you please check in the file attached above if it is a normal transformation of exonerate output by the script. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

MichaelFokinNZ commented 6 years ago

thank you, no rush. just really odd to have any issues with the script used millions of times... I will attach the exonerate files used as a test input. it is all public. exonerate_chunk8_output.txt exonerate_chunk9_output.txt

MichaelFokinNZ commented 6 years ago

sorry, any chance you had time to check it?

hyphaltip commented 6 years ago

No time. But remind me again what you need exactly - these are just Archive of things I use not intended to be production resources for anyone.

Jason Stajich, PhD jasonstajich.phd@gmail.com On Aug 19, 2018, 6:54 PM -0700, MichaelFokinNZ notifications@github.com, wrote:

sorry, any chance you had time to check it? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

MichaelFokinNZ commented 6 years ago

got it :) there two parts: (1) problem with the script - it technically reverses coordinates of features from the (-) strand - so misplacing them in a final gff. any guess why that can happen? (2) what are you using nowadays for protein2dna mapping, if not considering exonerate?

hyphaltip commented 6 years ago

Sorry I really dunno at this point - I wrote this 10 years ago or longer.

There are several versions of the parsers in that folder. I don't remember what the distinctions were for.

I do use exonerate to look at p2g or t2g alignments visually and I annotate with funannotate - I rely on the ryo I think for other aspects.

I write a quick and dirty targetted gene predictor here which uses it - it seems like it reads the location results natively from the ryo rather than parsing to gff3 but I don't quite recall. https://github.com/hyphaltip/genome-scripts/blob/5a986246c55467440e638740dddbb6bc8ec632aa/gene_prediction/tfastx2gene.pl

There was a point in the early iterations of exonerate where rev strand feature were unrolled in a weird way so that a DNA contig was basically represented like this in coordinates - I don't remember why. Maybe something I did to parse that was part of that schema.

--FWD-->   <---REV--

I really have little recollection. At this point there is GFF spit out from it so I am not sure what's the main use case.

I am sure ensembl has an exonerate parser as does funannotate so you can look at that code too or take from there.

MichaelFokinNZ commented 6 years ago

thank you! a lot of hints :) sure will be able to get around (already wrote a basic parser...)

MichaelFokinNZ commented 6 years ago

what helped - I've commented out the condition for (-) strand

if( $f->strand < 0 ) { my $s = $length - $f->end + 1; my $e = $length - $f->start +1; $f->start($s); $f->end ($e); }

PS. Other parsers, such as one from EVM (used in funannotate) don't produce "fully functional" gff.

MichaelFokinNZ commented 6 years ago

ha, another curious feature, was not easy to catch: for some reason for last protein in every chunk vulgar taken to gff is not it's own, but from the previous protein!

hyphaltip / genome-scripts

reversing of reverse strand features #1