`predict` arguments for mammal genome

nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline

http://funannotate.readthedocs.io

BSD 2-Clause "Simplified" License

320 stars 85 forks source link

`predict` arguments for mammal genome #403

Closed devonorourke closed 3 years ago

devonorourke commented 4 years ago

In the tutorial documentation for higher eukaryotes there are a few suggestions about additional parameters that ought to be passed:

--max_intronlen
--repeats2evm
--busco_db
--organism

My script for a mammal genome invoked each of these, plus the Augustus parameter to optimize their model with --optimize_augustus as follows:

funannotate predict -o funR1 -s "Myotis lucifugus" --cpus 20 -i /path/to/genome.fasta \
--optimize_augustus --max_intronlen 50000 --organism other --repeats2evm --busco_db mammalia

I started running this command but was curious about whether or not I needed to also invoke the --augustus_species parameter. I bring this up because I failed to invoke the --busco_seed_species parameter, and it defaulted to the fungus (anidulans) just like it said so in the commands documentation. This got me to wondering about two things:

Why would the busco_db and busco_seed_species parameters be separated? Are there lists of longer seed species outside of the databases themselves?
What happens when you don't specify the --augustus_species argument? What does it default to? In my case, I'm guessing I'll be selecting human (and from previous evals I know that there is only a small difference between models trained on my own data vs. human). But, given this list of Augustus species that Funannotate knows about, what does the program do when you don't select any option?

Thanks Jon!

nextgenusfs commented 4 years ago

Since you ran the training step with rna seq and as long as you specified the same output directory as you did for funannotate train, it will use the PASA results for training and won’t run BUSCO. The busco_seed_species is an option you can start the initial BUSCO run using that Augustus species parameters as a starting point, the pretrained options are viewable with the ‘funannotate species’ command. If not specified it will use the anidulans set for initial predictions in BUSCO.

nextgenusfs commented 4 years ago

You may also want to add —ploidy 2 to the command, in practice this won’t have a huge effect but it will keep a few more protein alignments and will allow for duplicates BUSCO results if you were using that for training, ie you didn’t have rna seq.

How many gene models did funannotate train end up with? Does it seem reasonable?

devonorourke commented 4 years ago

Thanks Jon, I did specify the same output directory for both funannotate train and funannotate predict, yet when I started up predict without --busco_seed_species or --augustus_species, a new directory "predict_misc" was created with the default fungi, anidulans selected. Maybe that wasn't going to be used anyway? For runs like mine that started with RNAseq data, are either of these parameters necessary (i.e. for Augustus or BUSCO)? In other words:

wrong

funannotate predict -o funR1 -s "Myotis lucifugus" --cpus 20 -i /path/to/genome.fasta \
--max_intronlen 50000 --organism other --repeats2evm --ploidy 2 \
--optimize_augustus --augustus_species human \
--busco_db mammalia --busco_seed_species mammalia

correct

funannotate predict -o funR1 -s "Myotis lucifugus" --cpus 20 -i /path/to/genome.fasta \
--max_intronlen 50000 --organism other --repeats2evm --ploidy 2

devonorourke commented 4 years ago

Which file and/or directory would contain the answer to the number of gene models funannotate train produced? Is the transcripts.fa file within .../training? That would be 31,575 sequences. Likewise, there are 31,575 non-header lines in the kallisto.tsv file.

nextgenusfs commented 4 years ago

Oh, you can just look at the logfile for training results. But 35k seems pretty decent from the RNA-seq assemblies. How many genes are you expecting in the genome?

You only specify an —augustus_species option if you want to reuse a training set. The default is to take the —species name and use it as a training name. There is no harm in specifying the --busco_seed_species.

devonorourke commented 4 years ago

I'm guessing the answer then is 19,604? log file excerpt:

[01:01 PM]: 80,788 transcripts derived from Trinity
[01:01 PM]: Running StringTie on Hisat2 coordsorted BAM
[01:01 PM]: Removing poly-A sequences from trinity transcripts using seqclean
[01:03 PM]: Converting transcript alignments to GFF3 format
[01:03 PM]: Converting Trinity transcript alignments to GFF3 format
[01:03 PM]: Running PASA alignment step using 80,782 transcripts
[07:49 PM]: PASA assigned 76,098 transcripts to 65,424 loci (genes)
[07:49 PM]: Getting PASA models for training with TransDecoder
[08:00 PM]: PASA finished. PASAweb accessible via: localhost:port/cgi-bin/index.cgi?db=/scratch/dro49/myluwork/annotation/fun1/funR1/training/pasa/Myotis_lucifugus
[08:00 PM]: Using Kallisto TPM data to determine which PASA gene models to select at each locus
[08:00 PM]: Building Kallisto index
[08:03 PM]: Mapping reads using pseudoalignment in Kallisto
[08:08 PM]: Parsing expression value results. Keeping best transcript at each locus.
[08:12 PM]: Wrote 19,604 PASA gene models
[08:12 PM]: PASA database name: Myotis_lucifugus
[08:12 PM]: Trinity/PASA has completed, you are now ready to run funanotate predict, for example:

With nearly 20k gene structures, this will likely take about 2-3 days in Augustus when the --optimize_augustus argument is invoked (from earlier tests via Maker). The Augustus documentation suggests I should be using far less - somewhere between 100-500 gene models (click the plus sign next to the Number and quality of gene structures if this isn't something you've had to deal with before). I noticed that you have a --min_training_models option within funannotate predict, but I wasn't sure how you deal with having too many gene models? Can I further refine this set? Any advice on what parameters to chose? My suggestions would likely fall into things you've already done internally:

gene models with start and stop codons
gene models with multiple exons
discard gene models with intron lengths outside of some boundary (for mammals, say something like discard anything with less than 50, anything more than 500,000)

If all of those conditions have been met, then maybe just randomly sample from your most contiguous scaffolds, say, anything bigger than 5-10 Mb. For my genome, there are chromosome-length scaffolds for the first 22, then length drops off dramatically. I could see subsetting gene models so that I get around 20-30 gene models from each scaffold. Subsetting across scaffolds might be overkill. Probably overkill.

nextgenusfs commented 4 years ago

@devonorourke don't worry -- funannotate predict will filter and sample those models appropriately for training, but yes it will use a subset. In practice, I don't use the --optimize_augustus option because I find that it does very little to improve predictions, hopefully it works better in your case. The filtering steps are related to number of exons/introns, ensuring evidence exists for the intron/exon boundaries, and making sure no model is 80% identical to another model in the training data.

Looks like SQLite/PASA took about 6 hours, that's not too bad I guess. But ~ 20K is a good number from how I run PASA here, it is fairly stringent because at this point we are really trying to only keep very high quality evidence backed gene models. If we do a good job on training, then the results should fall into place nicely (at least that is the idea).

devonorourke commented 4 years ago

Ok. I suppose the point here is two fold:

If I do run --optimize_augustus, I really should be down-sampling the number of gene models, because over-training can lead to worse results.
It's likely unnecessary to run --optimize_augustus because my results via Maker (and BUSCO evals of those subsequent predicted transcripts) show that there is no difference in BUSCO scores (run in 'transcriptome' mode using mammalia odb's 9 and 10) between a bat-optimized Augustus model, or a human model. I bet there is some minor difference, but perhaps it's not worth the effort.

I'm with you - my guess is that the effort on the front end of selecting really good gene models is what has a far greater impact on Augustus (and other) model training success and subsequent prediction, rather than trying to optimize Augustus. At least, that's my hope!

Just to confirm then, for my particular run, I'm a bit confused as to how to interpret your comment:

You only specify an —augustus_species option if you want to reuse a training set. The default is to take the —species name and use it as a training name. There is no harm in specifying the --busco_seed_species

Do I not need to specify human here? Is Funannotate finding a particular training set within the same directory that I ran funannotate train? My concern is that my species name I supplied in funnannotate train command, "Myotis lucifugus" is not among those listed with Funannotate's species file.
If I've already run funannotate train with RNAseq data, my contention is I should not execute either of the BUSCO arguments:

Since you ran the training step with rna seq and as long as you specified the same output directory as you did for funannotate train, it will use the PASA results for training and won’t run BUSCO

Thanks

nextgenusfs commented 4 years ago

You should NOT specify anything in --augustus_species. My comment was just a general one, that if --augustus_species is found in the training database (ie funannotate species) it will skip training and use those parameters. You don't want that.

You want to train a new species, so you actually don't want your species in the pertained "database". Since you specified --species "Myotis lucifugus" then the script will generate the training data with the name myotis_lucifugus, so after the script is finished it will output a JSON parameter file with the paths to all the relevant training data the script used. If you want to "commit" those results to your local training "database", you do that by adding that parameter file using funannotate species -- the script will tell you what to do (in your case will be in log file as seems like you are running non-interactively).

Since you have PASA input data, it won't run BUSCO. If you wanted to get into the habit of specifying --busco_seed_species, ie in the event that you don't have RNA-seq data for some other genome, then its harmless to add it to the command (in your specific case it won't do anything).

devonorourke commented 4 years ago

Thanks again Jon, Where would I find the JSON parameter file containing all the training data to commit to a local database? Is it somewhere within the training directory? The only .json file I can find is within mypath2/training/getBestModel/kallisto and is named run_info.json, with these contents:

{
        "n_targets": 31657,
        "n_bootstraps": 0,
        "n_processed": 87220303,
        "n_pseudoaligned": 48097286,
        "n_unique": 24278734,
        "p_pseudoaligned": 55.1,
        "p_unique": 27.8,
        "kallisto_version": "0.46.2",
        "index_version": 10,
        "start_time": "Wed Apr  8 06:26:25 2020",
        "call": "kallisto quant -i funR1n/training/getBestModel/bestModel -o funR1n/training/getBestModel/kallisto --plaintext -t 12 funR1n/training/trimmomatic/trimmed_left.fastq.gz funR1n/training/trimmomatic/trimmed_right.fastq.gz"
}

There is a file named Myotis_luficugus within /path2my/training/pasa but it appears to be a different format (SQLite 3.x database).

Just wondering what I need to type to commit to the local DB, as I can't find the path to the .json file needed for:

funannotate species -s myotis_lucifugus -a whereIsMy?.parameters.json

nextgenusfs commented 4 years ago

@devonorourke check the log file for predict and it should tell you near the end. But it is in the outfolder/predict_results directory along with the rest of the output files from predict. Note you don't need to commit it to the database, ie you might make some changes etc. You can re-ruse those parameters with funannotate predict --parameters option, where you would point it to that JSON file.

nextgenusfs commented 4 years ago

@devonorourke how do the results from predict look?

devonorourke commented 4 years ago

Thanks Jon. I'm realizing now the the .json file I'm looking for isn't in the train output at all - it's in the predict output, correct?

I'm still struggling to get predict to run properly. I tried running:

funannotate predict -o funR1n -s Myotis lucifugus --cpus 16 -i $INFASTA --min_intronlen 50 --max_intronlen 250000 --organism other --repeats2evm

...(where funR1n is the same output prefix from the previous train command) However, it appears that the program defaults to wanting to run BUSCO. Earlier you mentioned it should be doing that, right?

Since you have PASA input data, it won't run BUSCO

[04/08/20 09:36:17]: OS: linux2, 28 cores, ~ 132 GB RAM. Python: 2.7.15
[04/08/20 09:36:17]: Running funannotate v1.7.4
[04/08/20 09:36:17]: ERROR: dikarya busco database is not found, install with funannotate setup -b dikarya

The program isn't struggling to find databases:

(funenv) [dro49@wind fun1]$  echo $FUNANNOTATE_DB
/projects/foster_lab/reference_genomes/funannotate_db

and within that path you'll find lots of files/directories:

aves                    dbCAN-fam-HMMs.txt  dbCAN.hmm.h3m     funannotate-db-info.txt                funannotate-setup.log  laurasiatheria.tar.gz  merops.formatted.fa  ncbi_cleaned_gene_products.txt  Pfam-A.hmm.h3f  Pfam.version     uniprot.release-date.txt
aves.tar.gz             dbCAN.hmm           dbCAN.hmm.h3p     funannotate.repeat.proteins.fa         go.obo                 mammalia               merops_scan.lib      outgroups                       Pfam-A.hmm.h3i  repeats.dmnd     uniprot_sprot.fasta
busco_outgroups.tar.gz  dbCAN.hmm.h3f       eukaryota         funannotate.repeat.proteins.fa.tar.gz  interpro.xml           mammalia.tar.gz        mibig.dmnd           Pfam-A.clans.tsv                Pfam-A.hmm.h3m  trained_species
dbCAN.changelog.txt     dbCAN.hmm.h3i       eukaryota.tar.gz  funannotate.repeats.reformat.fa        laurasiatheria         merops.dmnd            mibig.fa             Pfam-A.hmm                      Pfam-A.hmm.h3p  uniprot.dmnd

... of course, none of these would be the dikarya database.

Maybe I misinterpreted what you were saying earlier. Should I be specifying something about BUSCO inputs in the predict command line?

nextgenusfs commented 4 years ago

Its probably just a failed logic loop, just pass --busco_db mammalia and it will stop complaining. It won't run BUSCO, but I must have that check for busco databases ahead of the logic for running BUSCO/PASA mediated training.

nextgenusfs commented 4 years ago

And the train and predict are somewhat of a misnomer -- where funannotate train generates all of the necessary training data. The actual training is done with funannotate predict, this is due to flexibility of being able to run/train on the fly with funannotate predict.

devonorourke commented 4 years ago

That is some tricky wording :)

How about funannotate warmup? funannotate gather? Or just be overt with funannotate getmypredictiondata?

Trying the new predict command with the additional busco flag. Thanks for the quick reply

devonorourke commented 4 years ago

Yep, the added flag seems to be the solution. Log file for predict appears as follows:

[10:06 AM]: OS: linux2, 32 cores, ~ 1588 GB RAM. Python: 2.7.15
[10:06 AM]: Running funannotate v1.7.4
[10:06 AM]: Found training files, will re-use these files:
  --rna_bam funR1n/training/funannotate_train.coordSorted.bam
  --pasa_gff funR1n/training/funannotate_train.pasa.gff3
  --stringtie funR1n/training/funannotate_train.stringtie.gtf
  --transcript_alignments funR1n/training/funannotate_train.transcripts.gff3
[10:06 AM]: Parsed training data, run ab-initio gene predictors as follows:
[10:07 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program        Training-Method
  augustus       pasa
  codingquarry   rna-bam
  genemark       selftraining
  glimmerhmm     pasa
  snap           pasa
[10:11 AM]: Genome loaded: 10,818 scaffolds; 2,036,012,059 bp; 40.11% repeats masked
[10:11 AM]: Parsed 62,316 transcript alignments from: funR1n/training/funannotate_train.transcripts.gff3
[10:11 AM]: Creating transcript EVM alignments and Augustus transcripts hintsfile
[10:11 AM]: Extracting hints from RNA-seq BAM file using bam2hints
[10:12 AM]: Mapping 549,368 proteins to genome using diamond and exonerate

Hopefully this doesn't take more than 7 days. Our cluster caps the runtime for all jobs at a week (though we can request extensions).

That's why I was curious at the very beginning of this thread to better understand the techniques behind what gets fed into Augustus specifically. The documentation I've seen points to a diminishing return on more than 1000 gene models, and I think I clearly have a bunch more than that. Perhaps it won't be as significant here because Augustus is just one of several modeling programs, and it all gets weighted via EVM.

Thanks for the support!

devonorourke commented 4 years ago

I should add one note after looking into the initial predict contents:

It looks like you are downsampling (somehow... haven't figured out how yet) the total number of gene models going into Augustus. The log/output from train suggested there were almost 20k:

[08:12 PM]: Wrote 19,604 PASA gene models

Yet inside of /Path2SomeOutputPrefix/predict_misc, there are only about 1000 training sequences that are split into the test (n=101) and train (n=910) parts. I'm curious: what defines a "good" and "bad" training file from PASA?

nextgenusfs commented 4 years ago

Yes -- I don't you to not worry about it and it would sample the models appropriately ! It works like this:

borrows idea (and script) from Braker - filters gene models for matches with the intron hints file generated from the transcript alignments, ie keeps gene models that match the evidence
then sorts gene models by number of exons (more is better for training as genes with a single exon should be excluded if possible as it results in poor prediction performance in the resulting HMM model). If enough models, it discards single exon gene predictions.
then translates putative models to protein space and does a pairwise alignment to ensure that no two models in the training set are within 80% identical to one another
if number of models > 1000 still at this point, it splits the test set into 10% of models remaining and uses the 90% for training. If < 1000 models, then uses 100 for testing and the rest for training. Typically anywhere from 500-1500 models pass these criteria.

devonorourke commented 4 years ago

Perfect. That's super helpful. It also makes me think that passing the --optimize_augustus might not take as long as I had previously observed because you're only using 1000 genes here instead of my full 20,000. Nevertheless, I didn't pass that flag, because I'm confident the filters you're describing are going to produce good quality gene models to work with. Plus, somehow in learning about this workflow your name keeps popping up in old forums asking the questions I'm thinking of asking you :) (ex. here). Maybe optimizing isn't even universally beneficial. So much of this somehow resolves back to manual inspection of predictions in a genome browser...

Related to the predict step: should it turn out that running CodingQuarry (or any other prediction modeling software) isn't doing well, you mentioned in another post here that the user can down-weight that software through an Evidence Modeler parameter:

Some users have provided feedback that CodingQuarry doesn't work well on non-fungal genomes, I haven't disabled it yet for non-fungi genomes, however, if you see that CodingQuarry is predicting 2X the number of genes as the other gene predictors, you can turn it off by adding --weights codingquarry:0 to the funannotate predict command.

Would doing so require rerunning all parts of predict anew? Because there is also an update command, I wasn't sure what would be the best approach to evaluate the contents of the command I'm running now, and if any modifications need to be made, how to implement those for a re-running (say, by dropping out CodingQuarry).

Hope all is well

nextgenusfs commented 4 years ago

If CodingQuarry (or any of the other ab initio predictors) is not producing good results, you can simply re-run the same command and pass --weights codingquarry:0 which will turn it off, ie leave it out of EVM consensus gene calling. Alternatively if you though that glimmerHMM was absolutely killing it on its predictions (triple its weight maybe, default is 1) and coding quarry was terrible, you could re-run with --weights codingquarry:0 glimmerhmm:3

devonorourke commented 4 years ago

Ah crap. Looks like a few thingsa re failing with the initial predict run:

-------------------------------------------------------
[10:06 AM]: OS: linux2, 32 cores, ~ 1588 GB RAM. Python: 2.7.15
[10:06 AM]: Running funannotate v1.7.4
[10:06 AM]: Found training files, will re-use these files:
  --rna_bam funR1n/training/funannotate_train.coordSorted.bam
  --pasa_gff funR1n/training/funannotate_train.pasa.gff3
  --stringtie funR1n/training/funannotate_train.stringtie.gtf
  --transcript_alignments funR1n/training/funannotate_train.transcripts.gff3
[10:06 AM]: Parsed training data, run ab-initio gene predictors as follows:
[10:07 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program        Training-Method
  augustus       pasa
  codingquarry   rna-bam
  genemark       selftraining
  glimmerhmm     pasa
  snap           pasa
[10:11 AM]: Genome loaded: 10,818 scaffolds; 2,036,012,059 bp; 40.11% repeats masked
[10:11 AM]: Parsed 62,316 transcript alignments from: funR1n/training/funannotate_train.transcripts.gff3
[10:11 AM]: Creating transcript EVM alignments and Augustus transcripts hintsfile
[10:11 AM]: Extracting hints from RNA-seq BAM file using bam2hints
[10:12 AM]: Mapping 549,368 proteins to genome using diamond and exonerate
[10:45 AM]: Found 0 preliminary alignments --> aligning with exonerate
[10:47 AM]: Exonerate finished: found 0 alignments
[10:48 AM]: Running GeneMark-ES on assembly
[10:48 AM]: GeneMark-ES failed: funR1n/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[10:48 AM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[10:48 AM]: Filtering PASA data for suitable training set
[10:51 AM]: 1,011 of 19,685 models pass training parameters
[10:51 AM]: Training Augustus using PASA gene models
[10:52 AM]: Augustus initial training results:
[10:52 AM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.
[10:52 AM]: Running Augustus gene prediction using myotis_lucifugus parameters

Where would I find the log file relevant to GeneMark-ES failed: funR1n/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.?
How often do you see Augustus throwing the warning about accuracy seeming low? Does it merit killing this job and just passing in the --optimize_augustus flag since it's just getting started, or should I let the job run and start up a new one later?

When restarting jobs - do I need to delete anything prior to restarting?

devonorourke commented 4 years ago

Here's what I think is the relevant output for judging Augustus. From a tutorial that discusses Augustus training here, it looks like a gene sensitivity of about 20% is the target, and my initial run was just under 18%.

Is the "Accuracy seems low, you can try to improve by passing..." statement in the log file thrown when you have that value less than 20%, or is there another metric you're using to judge the accuracy?

*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.829 |       0.915 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
----------------------------------------------------------------------------------------------------------|
           |        |        |      |                104 |                159 |             |             |
exon level |    531 |    586 |  427 | ------------------ | ------------------ |       0.729 |       0.804 |
           |    531 |    586 |      |   70 |    1 |   33 |   70 |    0 |   89 |             |             |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level |    85 |   101 |   18 |   67 |   83 |       0.178 |       0.212 |
----------------------------------------------------------------------------/

nextgenusfs commented 4 years ago

I would let it go -- I guess. It will be done in a few hours I think. I typically see much higher gene level recall. But looks like some other install issues to deal with.

Did you try funannotate test on your system/cluster?

GeneMark can die for several reasons, there is probably more information I the genemark log file. but one reason is the stupid license key can expire.

It also seemed like diamond/exonerate protein mapping failed -- see if any info in the log file for that routine as well. But based on that it found zero preliminary mappings suggests a problem with diamond and/or the formatting of the uniprot diamond database that was installed with funannotate setup.

devonorourke commented 4 years ago

Just trying to run funannotate test now.

Looks like clean works fine:

#########################################################
Running `funannotate clean` unit testing: minimap2 mediated assembly duplications
Downloading: https://osf.io/8pjbe/download?version=1 Bytes: 252076
-----------------------------------------------
6 input contigs, 6 larger than 500 bp, N50 is 427,039 bp
Checking duplication of 6 contigs
-----------------------------------------------
scaffold_91 appears duplicated: 100% identity over 100% of the contig. contig length: 8858
scaffold_73 appears duplicated: 100% identity over 100% of the contig. contig length: 15153
scaffold_27 appears duplicated: 100% identity over 100% of the contig. contig length: 427039
-----------------------------------------------
6 input contigs; 6 larger than 500 bp; 3 duplicated; 3 written to file
CMD: funannotate clean -i test.clean.fa -o test.exhaustive.fa --exhaustive
#########################################################
#########################################################
SUCCESS: `funannotate clean` test complete.
#########################################################

Looks like mask works fine`

#########################################################
Running `funannotate mask` unit testing: RepeatModeler --> RepeatMasker
Downloading: https://osf.io/hbryz/download?version=1 Bytes: 375687
[01:11 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15
[01:11 PM]: Running funanotate v1.7.4
[01:11 PM]: Soft-masking simple repeats with tantan
[01:11 PM]: Repeat soft-masking finished:
Masked genome: /scratch/dro49/myluwork/annotation/fun1/test-mask_121719/test.masked.fa
num scaffolds: 2
assembly size: 1,216,048 bp
masked repeats: 50,965 bp (4.19%)
-------------------------------------------------------
-------------------------------------------------------
CMD: funannotate mask -i test.fa -o test.masked.fa --cpus 16
#########################################################
#########################################################
SUCCESS: `funannotate mask` test complete.
#########################################################

The busco test failed initially because I didn't have the dikarya database installed, but I ran:

funannotate setup -b dikarya

and then reran the test function and everything except Genemark appears to be working. Curiously, there aren't any .log files within the genemark directory to review...

-----------------------------------------------
6 input contigs, 6 larger than 500 bp, N50 is 427,039 bp
Checking duplication of 6 contigs
-----------------------------------------------
scaffold_73 appears duplicated: 100% identity over 100% of the contig. contig length: 15153
scaffold_91 appears duplicated: 100% identity over 100% of the contig. contig length: 8858
scaffold_27 appears duplicated: 100% identity over 100% of the contig. contig length: 427039
-----------------------------------------------
6 input contigs; 6 larger than 500 bp; 3 duplicated; 3 written to file
[01:26 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15
[01:26 PM]: Running funanotate v1.7.4
[01:26 PM]: Soft-masking simple repeats with tantan
[01:26 PM]: Repeat soft-masking finished:
Masked genome: /scratch/dro49/myluwork/annotation/fun1/test-mask_123091/test.masked.fa
num scaffolds: 2
assembly size: 1,216,048 bp
masked repeats: 50,965 bp (4.19%)
-------------------------------------------------------
-------------------------------------------------------
-------------------------------------------------------
[01:26 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15
[01:26 PM]: Running funannotate v1.7.4
[01:26 PM]: Parsed training data, run ab-initio gene predictors as follows:
[01:26 PM]: CodingQuarry will be skipped --> --rna_bam required for training
[01:26 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program      Training-Method
  augustus     pretrained
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[01:26 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[01:27 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[01:27 PM]: Found 1,774 preliminary alignments --> aligning with exonerate
[01:27 PM]: Exonerate finished: found 1,321 alignments
[01:27 PM]: Running GeneMark-ES on assembly
[01:27 PM]: GeneMark-ES failed: annotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[01:27 PM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[01:27 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[01:31 PM]: 373 valid BUSCO predictions found, now formatting for EVM
[01:31 PM]: Running EVM commands with 15 CPUs
[01:31 PM]: Converting to GFF3 and collecting all EVM results
[01:31 PM]: 365 total gene models from EVM, now validating with BUSCO HMM search
[01:31 PM]: 365 BUSCO predictions validated
[01:31 PM]: Running Augustus gene prediction using saccharomyces parameters
[01:32 PM]: 1,607 predictions from Augustus
[01:32 PM]: Pulling out high quality Augustus predictions
[01:32 PM]: Found 370 high quality predictions from Augustus (>90% exon evidence)
[01:32 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[01:33 PM]: 0 predictions from SNAP
[01:33 PM]: SNAP prediction failed, moving on without result
[01:33 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[01:34 PM]: 1,772 predictions from GlimmerHMM
[01:34 PM]: Summary of gene models passed to EVM (weights):
[01:34 PM]: Running EVM commands with 15 CPUs
[01:35 PM]: Converting to GFF3 and collecting all EVM results
[01:35 PM]: 1,786 total gene models from EVM
[01:35 PM]: Generating protein fasta files from 1,786 EVM models
[01:35 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[01:35 PM]: Found 167 gene models to remove: 0 too short; 0 span gaps; 240 transposable elements
[01:35 PM]: 1,619 gene models remaining
[01:35 PM]: Predicting tRNAs
[01:35 PM]: 112 tRNAscan models are valid (non-overlapping)
[01:35 PM]: Generating GenBank tbl annotation file
[01:35 PM]: Converting to final Genbank format
[01:35 PM]: Collecting final annotation files for 1,731 total gene models
[01:35 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder
[01:35 PM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required):
funannotate iprscan -i annotate -m docker -c 16

Run antiSMASH:
funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome:
funannotate annotate -i annotate --cpus 16 --sbt yourSBTfile.txt
-------------------------------------------------------

[01:35 PM]: Training parameters file saved: annotate/predict_results/saccharomyces.parameters.json
[01:35 PM]: Add species parameters to database:

  funannotate species -s saccharomyces -a annotate/predict_results/saccharomyces.parameters.json

  Source         Weight   Count
  Augustus       1        1452
  Augustus HiQ   2        371
  GlimmerHMM     1        1772
  Total          -        3595
-------------------------------------------------------
[01:35 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15
[01:35 PM]: Running funannotate v1.7.4
[01:35 PM]: Parsed training data, run ab-initio gene predictors as follows:
[01:35 PM]: CodingQuarry will be skipped --> --rna_bam required for training
[01:35 PM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program      Training-Method
  augustus     busco
  genemark     selftraining
  glimmerhmm   busco
  snap         busco
[01:35 PM]: Genome loaded: 6 scaffolds; 3,776,588 bp; 19.75% repeats masked
[01:35 PM]: Mapping 1,065 proteins to genome using diamond and exonerate
[01:35 PM]: Found 1,774 preliminary alignments --> aligning with exonerate
[01:36 PM]: Exonerate finished: found 1,327 alignments
[01:36 PM]: Running GeneMark-ES on assembly
[01:36 PM]: GeneMark-ES failed: annotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[01:36 PM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[01:36 PM]: Running BUSCO to find conserved gene models for training ab-initio predictors

nextgenusfs commented 4 years ago

You should be able to see the command its running in the log file so could just run that manually, maybe missing a perl dependency in this environment?

Looks like snap also failed?

devonorourke commented 4 years ago

Certainly have the ability to run Genemark manually on my end. Then with predict it's just a matter of passing in that file with --genemark_gtf, correct?

The rest of the test generated these results, suggesting that everything but Genemark is working I suppose. One part failed near the end, and while predict mostly worked, it stalled when looking for a .gff3 file to run BUSCO, which I'm guessing might have something to the missing Genemark file?:

[01:39 PM]: 373 valid BUSCO predictions found, now formatting for EVM
[01:40 PM]: Running EVM commands with 15 CPUs
[01:40 PM]: Converting to GFF3 and collecting all EVM results
[01:40 PM]: 365 total gene models from EVM, now validating with BUSCO HMM search
[01:40 PM]: 365 BUSCO predictions validated
[01:40 PM]: Training Augustus using BUSCO gene models
[01:40 PM]: Augustus initial training results:
[01:40 PM]: Running Augustus gene prediction using awesome_busco parameters
[01:41 PM]: 1,393 predictions from Augustus
[01:41 PM]: Pulling out high quality Augustus predictions
[01:41 PM]: Found 316 high quality predictions from Augustus (>90% exon evidence)
[01:41 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[01:41 PM]: 0 predictions from SNAP
[01:41 PM]: SNAP prediction failed, moving on without result
[01:41 PM]: Running GlimmerHMM gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[01:42 PM]: 1,772 predictions from GlimmerHMM
[01:42 PM]: Summary of gene models passed to EVM (weights):
[01:42 PM]: Running EVM commands with 15 CPUs
[01:43 PM]: Converting to GFF3 and collecting all EVM results
[01:43 PM]: 1,710 total gene models from EVM
[01:43 PM]: Generating protein fasta files from 1,710 EVM models
[01:43 PM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[01:43 PM]: Found 154 gene models to remove: 0 too short; 0 span gaps; 224 transposable elements
[01:43 PM]: 1,556 gene models remaining
[01:43 PM]: Predicting tRNAs
[01:43 PM]: 106 tRNAscan models are valid (non-overlapping)
[01:43 PM]: Generating GenBank tbl annotation file
[01:43 PM]: Converting to final Genbank format
[01:43 PM]: Collecting final annotation files for 1,662 total gene models
[01:43 PM]: Funannotate predict is finished, output files are in the annotate/predict_results folder
[01:43 PM]: Your next step might be functional annotation, suggested commands:
-------------------------------------------------------
Run InterProScan (Docker required):
funannotate iprscan -i annotate -m docker -c 16

Run antiSMASH:
funannotate remote -i annotate -m antismash -e youremail@server.edu

Annotate Genome:
funannotate annotate -i annotate --cpus 16 --sbt yourSBTfile.txt
-------------------------------------------------------

[01:43 PM]: Training parameters file saved: annotate/predict_results/awesome_busco.parameters.json
[01:43 PM]: Add species parameters to database:

  funannotate species -s awesome_busco -a annotate/predict_results/awesome_busco.parameters.json

  Feature       Specificity   Sensitivity
  nucleotides   99.3%         84.0%
  exons         69.6%         56.5%
  genes         82.0%         55.8%
  Source         Weight   Count
  Augustus       1        1077
  Augustus HiQ   2        316
  GlimmerHMM     1        1772
  Total          -        3165
-------------------------------------------------------
[01:43 PM]: OS: linux2, 24 cores, ~ 132 GB RAM. Python: 2.7.15
[01:43 PM]: Running funannotate v1.7.4
[01:43 PM]: Ab initio training parameters file passed: annotate/predict_results/awesome_busco.parameters.json
Traceback (most recent call last):
  File "/scratch/dro49/conda/envs/funenv/bin/funannotate", line 660, in <module>
    main()
  File "/scratch/dro49/conda/envs/funenv/bin/funannotate", line 650, in main
    mod.main(arguments)
  File "/scratch/dro49/conda/envs/funenv/lib/python2.7/site-packages/funannotate/predict.py", line 467, in main
    LOCALPARAMETERS, aug_species+'.genemark.mod'))
  File "/scratch/dro49/conda/envs/funenv/lib/python2.7/shutil.py", line 96, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: u'/scratch/dro49/myluwork/annotation/fun1/test-busco_123091/annotate/predict_misc/ab_initio_parameters/awesome_busco.genemark.mod'

#########################################################
Running `funannotate clean` unit testing: minimap2 mediated assembly duplications
CMD: funannotate clean -i test.clean.fa -o test.exhaustive.fa --exhaustive
#########################################################
#########################################################
SUCCESS: `funannotate clean` test complete.
#########################################################

#########################################################
Running `funannotate mask` unit testing: RepeatModeler --> RepeatMasker
CMD: funannotate mask -i test.fa -o test.masked.fa --cpus 16
#########################################################
#########################################################
SUCCESS: `funannotate mask` test complete.
#########################################################

#########################################################
Running `funannotate predict` unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --augustus_species saccharomyces --cpus 16 --species Awesome testicus
#########################################################
#########################################################
SUCCESS: `funannotate predict` test complete.
#########################################################

#########################################################
Running `funannotate predict` BUSCO-mediated training unit testing
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate --cpus 16 --species Awesome busco
#########################################################
#########################################################
SUCCESS: `funannotate predict` BUSCO-mediated training test complete.
#########################################################
Now running predict using all pre-trained ab-initio predictors
CMD: funannotate predict -i test.softmasked.fa --protein_evidence protein.evidence.fasta -o annotate2 --cpus 16 --species Awesome busco -p annotate/predict_results/awesome_busco.parameters.json
#########################################################
#########################################################
Traceback (most recent call last):
  File "/scratch/dro49/conda/envs/funenv/bin/funannotate", line 660, in <module>
    main()
  File "/scratch/dro49/conda/envs/funenv/bin/funannotate", line 650, in main
    mod.main(arguments)
  File "/scratch/dro49/conda/envs/funenv/lib/python2.7/site-packages/funannotate/test.py", line 391, in main
    runBuscoTest(args)
  File "/scratch/dro49/conda/envs/funenv/lib/python2.7/site-packages/funannotate/test.py", line 209, in runBuscoTest
    tmpdir, 'annotate2', 'predict_results', 'Awesome_busco.gff3')) <= 1800
  File "/scratch/dro49/conda/envs/funenv/lib/python2.7/site-packages/funannotate/test.py", line 40, in countGFFgenes
    with open(input, 'rU') as f:
IOError: [Errno 2] No such file or directory: 'test-busco_123091/annotate2/predict_results/Awesome_busco.gff3'

devonorourke commented 4 years ago

I think Genemark is failing because of a missing file, however there isn't anything else in the log file that matches to genemark to suggest why that file doesn't exist...

[01:27 PM]: GeneMark-ES failed: annotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.

No idea about SNAP; nothing in the log file except:

snap         busco
snap         busco

devonorourke commented 4 years ago

I think I found the Genemark error; this came from the funannotate test script

[04/08/20 13:36:16]: (None, "Can't locate YAML.pm in @INC (@INC contains: /scratch/dro49/conda/envs/funenv/lib/site_perl /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /projects/foster_lab/pkgs/gmes_linux_64/gmes_petap.pl line 76.\nBEGIN failed--compilation aborted at /projects/foster_lab/pkgs/gmes_linux_64/gmes_petap.pl line 76.\n")

Not sure why it's not recognizing that Perl module thought... it's right where it should be:

(funenv) [dro49@wind 5.26.2]$  pwd
/scratch/dro49/conda/envs/funenv/lib/site_perl/5.26.2
(funenv) [dro49@wind 5.26.2]$  ls
Algorithm     Array    Business  CGI.pm       Convert  Digest    Eval         fields.pm  Graph.pod    HTTP      JSON     Logger       MailTools.pm   Method    Moo           Object    parent.pm   Reply        Statistics  Test     Try        WWW
Apache        Authen   Cache     CGI.pod      CPAN     Dist      Exception    File       GraphViz     Image     JSON.pm  LWP          MailTools.pod  MIME      Moo.pm        OLE       Parse       Role         Sub         Test.pm  ttfmod.pl  x86_64-linux-thread-multi
App           auto     Capture   Class        Crypt    Email     Exporter     Font       GraphViz.pm  IO        lib.pm   lwpcook.pod  Math           MLDBM     Mozilla       oo.pm     PDF         Set          SVG         Text     Type       XML
AppConfig     base.pm  Carp      Clone        Data     Encode    Exporter.pm  Getopt     Hash         IPC       libwww   LWP.pm       MCE            MLDBM.pm  MRO           Package   Perl        SOAP         SVG.pm      Tie      Types      YAML
AppConfig.pm  Bio      Carp.pm   Compress     Date     Error     ExtUtils     Graph      Heap071      Jcode     List     lwptut.pod   MCE.pm         Module    Net           Parallel  Pod         Sort         TAP         Time     URI        YAML.pm
Archive       Bundle   CGI       constant.pm  Devel    Error.pm  Fh.pm        Graph.pm   HTML         Jcode.pm  Locale   Mail         MCE.pod        MojoX     newgetopt.pl  Params    PostScript  Spreadsheet  Task        Tree     URI.pm     YAML.pod

Additionally, if I have the funannotate Conda environment active and just type /projects/foster_lab/pkgs/gmes_linux_64/gmes_petap.pl, the module loads without error.

Would the manual Genemark command mirroring what Funannotate is doing just be:

gmes_petap.pl my.fasta --ES

or perhaps there are extra parameters to invoke?

nextgenusfs commented 4 years ago

The genemark scripts have a nasty shebang line that defaults do /usr/bin/perl, so if you installed perl modules in a site or local perl, they won't be picked up. So you have to change the shebang line to /usr/bin/env perl which will then work as expected. Note all perl scripts in the genemark installation need to be patched.

devonorourke commented 4 years ago

Is that just for gmes_petap.pl or are there other Perl scripts to modify shebang lines for?

nextgenusfs commented 4 years ago

You need to change all perl scripts. At least that is what has worked for me. Its really a silly way to distribute the code....

But to answer your previous question, you can pass the genemark.gtf that you already ran, it isn't doing anything special for running -ES

devonorourke commented 4 years ago

Thanks. I haven't run Genemark yet. Just wanted to double check that the command I was going to run manually would reflect what Funannotate was doing internally.

Still unclear on why SNAP is failing at the moment, as that failed in the test too. However, the protein alignment with Diamond worked just fine in the test, and didn't appear to work internally. Looks like I'm going to be debugging for a bit.

Thanks for the help

nextgenusfs commented 4 years ago

At least one user has found that one of the snap executables was not compiled properly on their system from Conda install -- you should be able to find it in the issues tab.

devonorourke commented 4 years ago

probably this? https://github.com/nextgenusfs/funannotate/issues/386#issuecomment-591694156

nextgenusfs commented 4 years ago

Yes that was it -- looks like recompiling forge might help you. Hardest part of genome annotation is installing the software.....

devonorourke commented 4 years ago

I went back and changed all the Perl files in Genemark and funannotate test worked properly this time:

[06:41 PM]: Running GeneMark-ES on assembly
[06:42 PM]: 1,576 predictions from GeneMark

I suppose folks installing via Conda are typically going to run into this problem then?

I followed the instruction posted on issue #386: I replaced the forge binary installed in the Funannotate Conda environment with a previously compiled forge binary installed separately (via git) at SNAP's GitHub page. I only moved the forge binary, for what it's worth, and it seems to have worked for Funannotate's SNAP needs:

[06:48 PM]: Running SNAP gene prediction, using training data: annotate/predict_misc/busco.final.gff3
[06:48 PM]: 1,440 predictions from SNAP

Finally, though I'm still testing it, I think the reason that Diamond failed in my main run but succeeded in this test run is due to memory needs. Thus while the small test didn't fail, I tried submitting the same two Diamond commands (makedb and blastx) on their own using the same input files from the failed Funannotate run, but job erred out with an out of memory message at the very end of the process when it starts this part: Computing alignments.... I don't remember seeing such a note in any log file when running it from Funannotate. Are out-of-memory errors typically picked up in Funannotate logs?

I upped the memory to 40G for the rerun (of just the two Diamond commands) and it also failed, so I wondered if the parameters need some sort of modification. The most likely target in my mind would be to alter the -k parameter to some value other than 0 (as it is currently set). I tried that, and it failed at the Computing alignments... state with 40G of RAM.

Then I thought that perhaps recording every possible alignment is helpful, but maybe additional parameters like query coverage (--query-cover) would be useful? I tried that, and it failed, also at the same Computing alignments... step.

I tried one final thing that failed: upping to 120Gb of RAM with 12 CPUs, using the same parameters as were used in the Funannotate run, and the job again failed. ☠️ 💢 😠 ...

One of the things I noticed from the Diamond GitHub issues posts were comments about reducing the number of threads to get jobs to finish. I had been running with 12 threads for the earlier attempts, but when I reran the same (previously failed) commands on 8 CPUs with 120G. Low and behold, everything worked 👼 ! It makes me think that the total RAM for Diamond needs to be kept in some ratio with the number of CPUs a user is allocating to the entire Funannotate job. I bet there is some tinkering with the block size parameters of Diamond that might make these jobs not crash within Funannotate while maintaining more CPUs as needed. It seems like the ratio is about 10G RAM for every CPU, I think.

One final bit about these parameters for Diamond: invoking a query coverage of 80% gives ridiculously low numbers of matches compared to the default of not passing that flag at all. In my dataset, I reduced the number of matches by passing -k 100 (instead of keeping all matches, just keeping 100), and it generated 136,669 matches. However, when I keep -k 0 but reduce the matches based on requiring at least 80% query coverage, there are only 32!! matches.

Maybe limiting to something like 100 or 500 hits would suffice and reduce the amount of memory required for these larger eukaryote genomes, but it seems like any parameters we try to modify ultimately aren't the real problem: I think it's the mix of the number of CPUs and RAM provided.

Finally, I didn't see any way to modify the diamond section of predict in the docs, so I'm guessing any change to the protein alignment step would need to be manually adjusted starting here?

Thanks

nextgenusfs commented 4 years ago

Thanks for looking into Diamond issues. We could maybe just cap the number of CPUs for that step to something like 8 or even lower, maybe 4 would make sense? I guess I'd rather err on the side of it finishing and not crashing at the expense of speed. But the other issue is I don't know how I can capture the RAM that was specified in your cluster submission step, the RAM it calculates is what is available on that node. So leaning towards setting just a limit on the CPUs. I can make a quick change on the code and you can install with pip into your funannotate conda environment if that is helpful.

Do you have to specify RAM usage on your cluster? We don't have to, as I will just run it on a node (most of ours are 24 cores with 250 GB RAM) and I've never had an issue -- but duly noted that I've also never tried to annotate a genome this large....

The Diamond parameters were set/tinkered with to try to find the settings that gave results most similar to tblastn, which I stopped using for speed reasons -- basically multi-threading in tblastn was/is broken in nearly all blast+ versions from 2.3 -- 2.9, so it would die silently without error in the middle of the search. But it was too slow using a single thread....

nextgenusfs commented 4 years ago

Limited diamond to 8 cpus. you can install the updated code from within your conda environment with this command:

python2 -m pip install git+https://github.com/nextgenusfs/funannotate.git

devonorourke commented 4 years ago

Thanks Jon, I started a new run last night that appears to be working well with Diamond; I gave it 120 G of RAM while running just 8 threads and it' now working through (597,903!) preliminary alignments with Exonerate (from 62,316 transcripts):

[08:06 PM]: Mapping 549,368 proteins to genome using diamond and exonerate
[09:42 PM]: Found 597,903 preliminary alignments --> aligning with exonerate

The old log file from the failed Diamond step indicated it was also able to attempt mapping the protein set, but there weren't any primary alignments found:

[10:12 AM]: Mapping 549,368 proteins to genome using diamond and exonerate
[10:45 AM]: Found 0 preliminary alignments --> aligning with exonerate

Fingers crossed this job will continue working. By the way, is it easy to restart a Funannotate job or is the entire pipeline required to be rerun? Ex. Diamond and Genemark finish, but the plug gets pulled on PASA...

When I was at UNH we didn't have to specify RAM, but at NAU we do. Most of the cores here are 24 cpus with 128 GB RAM, though there are a mix of 28/250, 32/500, and then a pile of GPU units. I wish I understood how to make things work with GPUs...

The only way I can think of having you capture CPUs and RAM is for the user to specify it up front as a parameter of funannotate train. Even then, I'm not sure what you'd do about it with Diamond. I like your solution of capping the Diamond job to no more than 8 CPUs while being able to allocate more to other parts of the pipeline; maybe other Diamond experts can chime in on how to better tweak parameters for jobs with larger numbers of memory or cpus.

Thanks

devonorourke commented 4 years ago

Took about a day and a half to finish the Diamond/Exonerate work, but it ran successfully using the original Conda environment software (I haven't run pip installon the new code yet):

[04/08/20 20:06:10]: Mapping 549,368 proteins to genome using diamond and exonerate
[04/08/20 20:06:10]: Diamond v0.9.21; Exonerate v2.4.0
[04/08/20 20:06:10]: diamond makedb --threads 8 --in /scratch/dro49/myluwork/annotation/fun1/funR1n/predict_misc/proteins.combined.fa --db diamond
[04/08/20 20:06:19]: diamond blastx --threads 8 -q /scratch/dro49/myluwork/annotation/fun1/funR1n/predict_misc/genome.softmasked.fa --db diamond -o diamond.matches.tab -e 1e-10 -k 0 --more-sensitive -f 6 sseqid slen sstart send qseqid qlen qstart qend pident length evalue score qcovhsp qframe
[04/08/20 21:42:22]: Found 597,903 preliminary alignments --> aligning with exonerate
[04/10/20 05:59:37]: Exonerate finished: found 14,469 alignments

Guessing that your updated code capping just the diamond blastx step will go a long way to speeding things up once I can throw more cpus at the Exonerate step.

nextgenusfs commented 4 years ago

Exonerate is very slow. But good news is that you only need to run it once, ie you can reuse that data either through a rerun where it will find it or you can pass it as an argument if doing a clean run in a new directory.