nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
321 stars 85 forks source link

Evidence Modeler is partitioning with very small overlaps, invalidating gene models #558

Closed wittetom closed 3 years ago

wittetom commented 3 years ago

First, thanks for funannotate. I'm a new user.

I've installed funannotate v1.8.3 using Conda (downgraded BUSCO to v3.3.3).

It appears gene models are spanning what is a mere 500bp overlap between the EVM partitions when I run funannotate predict. I thought EVM was partitioning after every 35 genes in a way which avoids spanned genes (I see the overlap size should be 10,000?) Does this mean the funannotate-runEVM.py isn't working properly? Is there a way to adjust the overlap size?

here's an example of my command

funannotate predict -i Fp133.masked.contigs.fasta -o Fp133_training --busco_seed_species fusarium --busco_db hypocreales_odb10 --cpus 24 --weights codingquarry:0 pasa:0 augustus:1 genemark:1 hiq:2 snap:0 GlimmerHMM:0 -s Fp133

Thanks for your help! Tom

nextgenusfs commented 3 years ago

It should only be choosing locations that are in between gene models (it scans through an interlap object that contains all ab initio gene models). This is a partition strategy unique to funannotate as the default way of running it is to split contigs by size and overlap was leading to splitting gene models.

So do you have some examples where it partitioned in the middle of a gene model?

wittetom commented 3 years ago

Sure - I noticed the problem because one of the predicted terpene synthases disappeared when I ran predict (I'm comparing to output from an older version of funannotate which showed the presence of this terpene synthase):

HiQ gene model is gene.g10768.t1, location is 6,880,448 <- 6,881,996. (-) on Chr4. Partitions are: Chr4_6880672-6974571, Chr4_6736616-6881172. When I open both the evm.log files in each partition directory, I can't find mention of this gene model (or any of the other gene models associated with this gene, there is also Genemark, pasa, snap, Glimmer)

The next partition over also disrupts the gene models: The hiq gene model gene.g10833.t1 runs from 7,088,197 <- 7,089,885 (-) on Chr4 Partitions at: Chr4_6974071-7088791 Chr4_7088291-7209543

Note, neither of these genes are repeat elements.

I'll attach the EVM log here: funannotate-EVM.log

And the output of funannotate check --show-versions: Maybe I should note, I used mamba to install funannotate from the bioconda channel.

Checking dependencies for 1.8.3

You are running Python v 3.7.9. Now checking python packages... biopython: 1.76 goatools: 1.0.15 matplotlib: 3.3.4 natsort: 7.1.1 numpy: 1.20.0 pandas: 1.2.1 psutil: 5.7.0 requests: 2.25.1 scikit-learn: 0.24.1 scipy: 1.5.3 seaborn: 0.11.1 All 11 python packages installed

You are running Perl v b'5.026002'. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed

Checking Environmental Variables... $FUNANNOTATE_DB=/home/tommy/funannotate/DB/ $PASAHOME=/home/tommy/miniconda3/envs/funannotate/opt/pasa-2.4.1 $TRINITY_HOME=/home/tommy/miniconda3/envs/funannotate/opt/trinity-2.8.5 $EVM_HOME=/home/tommy/miniconda3/envs/funannotate/opt/evidencemodeler-1.1.1 $AUGUSTUS_CONFIG_PATH=/home/tommy/miniconda3/envs/funannotate/config/ ERROR: GENEMARK_PATH not set. export GENEMARK_PATH=/path/to/dir

Checking external dependencies... PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v36 diamond: 2.0.6 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 11.0.8-internal kallisto: 0.46.1 mafft: v7.475 (2020/Nov/23) makeblastdb: makeblastdb 2.10.0+ minimap2: 2.17-r941 proteinortho: 6.0.28 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.10 signalp: 5.0b snap: 2006-07-28 stringtie: 2.1.4 tRNAscan-SE: 2.0.7 (Oct 2020) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.10.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: gmes_petap.pl not installed

Last, I don't know if this is useful or not, but here's a screenshot from geneious comparing old annotations (top track) and new (bottom track, with gene models in green underneath). This is how I'm finding the dropped genes. Every time I run this genome I'm about 1,000 genes short of what was previously obtained and I'm just trying to figure out how EVM decides to drop genes.

image

Thanks for your time and help!

nextgenusfs commented 3 years ago

Hmm, so the the logic of the script is supposed to be finding intergenic regions that are at least 2kb, it then splits in the middle of that that this sort of thing doesn't happen. In order to do this it actually pulls out all of the genes coordinates and then merge any overlapping models to create sort of "super genes" and then it uses that to find regions to split/partition. So that file is located in predict_misc/EVM/genes.PID.bed, would you be able to send me that file? Or better yet would be to import that into Geneious and show this same region?

If you have enough RAM you can pass --no-evm-partitions which will run each contig/chr as a separate EVM job. Should prevent this from happening, but I'd like to fix the underlying issue. I see that that isn't in the help menu.... I'll fix that.

wittetom commented 3 years ago

Thanks for the quick reply.

Here's the same region again from Geneious with genes.PID.bed in grey. I'll try running things with no EVM partitions next. Cheers

image

nextgenusfs commented 3 years ago

Thanks -- okay, something is wrong here with the merging of the gene models, there should be a single "super gene" at each "locus". So I think I know where to try to fix in the method -- thanks. I'll let you know when its fixed.

nextgenusfs commented 3 years ago

Hi @wittetom -- I've not tracked down exactly why this failed -- I added a few more debugging steps in the latest commit that might help figure out why. I've exposed a --evm-partition-interval option that will control the minimum length gap between genes that it will use, default is 2000 -- so you can increase this value to pick partitions where there is a larger space between the "super genes". I've also increased the overlap in these partitions slightly to see if that has an effect on EVM. The caveat here is that larger --evm-partition-interval will result in EVM running slower due to larger chunk sizes as it won't make partitions as frequently depending on the gene density (for fungi this is relatively high). There will also be a new "super genes" bed file that is written to the predict_misc/EVM directory -- these are the coordinates it uses to find partitions. I did not change the underlying code for this (it was working correctly) but wrote it to file so we could troubleshoot.

So it would be helpful for me if you had time to run this a few different ways on your data -- ie first try to just re-run as default and see if the increase in overlaps is enough to fix those gene calls near the ends of the partitions, then would be to try a few larger values for --evm-partition-interval to see if there is a sweet spot for accuracy/speed. The slowest but most accurate would be to pass the --no-evm-partitions option.

wittetom commented 3 years ago

Thanks @nextgenusfs . Truly sorry for the delay in responding, I had to turn away from the genomes last week unfortunately.

First, I reran with --no-evm-partitions (before your update). The EVM step takes 10 hours this way. (44Mb genome, running with 24 cpus, 74G ECC RAM and older processors - dual Xeon X5680). All genes of interest were accounted for this way, as expected (this added +379 genes to the final output and solved the missing terpene synthase problem).

Next, I updated to the latest commit and then reran with your new default EVM partitioning parameters. The run-time for EVM is now only 25 minutes (virtually the same as before the update), however some genes are still being dropped. That said, there are now +270 more genes than before the update. I think this is probably because of the enlarged overlap. I checked the supergenes bed file and there are still supergenes being called where the dropped genes are. I think the supergenes are being correctly assigned, but the partitioning is problematic.

Next, I tried enlarging the --evm-partition-interval to 3000 bp. EVM now takes 10 hours as before, but there are still some dropped genes (it called +340, I'm not sure which genes got dropped though). I see it didn't detect partition sites in Chr1, which is ~12Mbp.

I can really see the value in this partitioning script functioning properly - let me know how I can help further.

Cheers Tom

nextgenusfs commented 3 years ago

Thanks this is helpful. I think I have idea of what I want to try next, hopefully have time this week. I will let you know when I have something new to try.

nextgenusfs commented 3 years ago

Hi @wittetom -- I pushed some updates to the EVM partitioning scheme. So previously I was trying to find these "super gene free" regions > 2kb and use those as partition sites, the code was then taking the midway point between the adjacent gene model locations and then adding some overlap (length / 3) in each direction. So I guess EVM needs more DNA context for some of these regions. So I've now got the code to essentially overlap the entire gene free region in the adjacent partitions, so this should hopefully fix the issue. I ran some local tests and determined that I was getting the same results from with --no-evm-partitions and the new code even when I reduced the interval size to 1500 -- which would effectively be ensuring that EVM has access to at least 1500 bp outside of the gene model predictions for the gene on the end of the partition. Reducing the interval to 1500 also resulted in a performance gain.

If you would be able to try out the new code and let me know how it performs in relation to your data with --no-evm-partitions that would be helpful. Thanks!

wittetom commented 3 years ago

Hi @nextgenusfs , so far so good - I ran the latest commit and the genes I'm interested in were all called, EVM was very fast. I'll check more thoroughly soon and let you know if I detect any issues. Thanks for your help!