nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Command to generate snap-predictions.gff3? #969

Closed nam-hoang closed 8 months ago

nam-hoang commented 8 months ago

Hi Funannotate Team, Thanks for an amazing tool.

I am running funannotate v1.8.15 installed from mamba, and everything looks great except one issue that I can't seem to get SNAP prediction to work properly within the pipeline. I keep getting "0 predictions from SNAP", a snap-predictions.gff3 was created, but empty with only one header line ##gff-version 3.

I already tried the suggestions in #386 by copy forge from github version to my conda environment, but it did not solve the problem. So, I tried to train SNAP outside and attempted to copy the snap-predictions.gff3 to folder predict_misc/. When I ran it, the pipeline did recognize the gff3 file, i.e., Existing snap predictions found /predict_misc/snap-predictions.gff3". But here again, I still end up "0 predictions from SNAP". I generated this snap-predictions.gff3 by running snap -gff snap-trained.hmm genome.softmasked.fa, and my gff3 file is ~37 Mb in size and looks like the below:

ntLink_0 SNAP Einit 355 383 7.591 - . ntLink_0-snap.1 ntLink_0 SNAP Exon 259 331 5.518 - . ntLink_0-snap.1 ntLink_0 SNAP Exon 18 164 9.903 - . ntLink_0-snap.1 ntLink_0 SNAP Einit 1856 1976 2.088 - . ntLink_0-snap.2 ntLink_0 SNAP Exon 1667 1766 18.649 - . ntLink_0-snap.2 ntLink_0 SNAP Exon 1456 1577 10.216 - . ntLink_0-snap.2 ntLink_0 SNAP Exon 1282 1366 12.706 - . ntLink_0-snap.2 ntLink_0 SNAP Exon 1048 1207 15.412 - . ntLink_0-snap.2 ntLink_0 SNAP Exon 768 907 11.613 - . ntLink_0-snap.2

In the log file, I only found the command snap snap-trained.hmm genome.softmasked.fa without -gff option. Also mentioned here https://github.com/nextgenusfs/funannotate/issues/386#issuecomment-591629954, so I wonder if what I did was correct.

Could you please advise me if this gff3 file looks as expected? Or how to get the right GFF3 format for the pipeline?

Thank you very much. Best regards, Nam

hyphaltip commented 8 months ago

it needs to be gff3 - this is generated by a script within funannotate from the zff that comes from snap

@nextgenusfs can you remember where that conversion comes- I cannot find a suitable funannotate util that would serve this purpose. Can one instead just copy the .zff file into the predict_misc folder and then I think might trigger regeneration of the gff3 without trying to run the snap step? I don't quite know.

here's a working snap-predictions.gff3:

##gff-version 3
scaffold_1  snap    gene    2772    4405    .   +   .   ID=scaffold_1-snap.1;
scaffold_1  snap    mRNA    2772    4405    .   +   .   ID=scaffold_1-snap.1-T1;Parent=scaffold_1-snap.1;product=[];
scaffold_1  snap    exon    2772    2780    .   +   .   ID=scaffold_1-snap.1-T1.exon1;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    exon    2865    2913    .   +   .   ID=scaffold_1-snap.1-T1.exon2;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    exon    3000    3068    .   +   .   ID=scaffold_1-snap.1-T1.exon3;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    exon    3971    4036    .   +   .   ID=scaffold_1-snap.1-T1.exon4;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    exon    4359    4405    .   +   .   ID=scaffold_1-snap.1-T1.exon5;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    CDS 2772    2780    .   +   0   ID=scaffold_1-snap.1-T1.cds;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    CDS 2865    2913    .   +   0   ID=scaffold_1-snap.1-T1.cds;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    CDS 3000    3068    .   +   2   ID=scaffold_1-snap.1-T1.cds;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    CDS 3971    4036    .   +   2   ID=scaffold_1-snap.1-T1.cds;Parent=scaffold_1-snap.1-T1;
scaffold_1  snap    CDS 4359    4405    .   +   2   ID=scaffold_1-snap.1-T1.cds;Parent=scaffold_1-snap.1-T1;
nam-hoang commented 8 months ago

Hi Jason, @hyphaltip Thank you so much for your suggestions. I copied the file snap-predictions.zff to predict-misc folder, however, funannotate seems to only check if snap-predictions.gff3 is there, otherwise, it will start snap training, and rewrite all files.

So, to have a snap-predictions.gff3 that can be recognized by EVM at this step, I found 2 perl scripts from EVM that can convert the SNAP gff format to EVM gff3 format https://github.com/EVidenceModeler/EVidenceModeler/tree/master/EvmUtils/misc.

(1) using SNAP_CDS_to_gff3.pl ./zff2gff3.pl snap-predictions.zff > snap-predictions_CDS.gff3 ./SNAP_CDS_to_gff3.pl snap-predictions_CDS.gff3 > snap-predictions_CDSformat_4EVM.gff3

(2) SNAP_ExonEtermEinitEsngl_gff_to_gff3.pl ./snap -gff snap-trained.hmm genome.softmasked.fa > snap-predictions_SNAPformat.gff3 ./SNAP_ExonEtermEinitEsngl_gff_to_gff3.pl snap-predictions_SNAPformat.gff3 > snap-predictions_SNAPformat_4EVM.gff3

After converting, I copied either or these two to predict-misc folder, renamed the file to snap-predictions.gff3, and it works. Let me know what you think about this. Thank you very much.

Best regards, Nam

hyphaltip commented 8 months ago

great - I don't remember the internal steps to converting zff to gff3 within funannotate - there is python code to do it. butam glad you have this fixed. I will see if we can expose this conversion step as a util option in funannotate in future.