Closed wittetom closed 3 years ago
It should only be choosing locations that are in between gene models (it scans through an interlap object that contains all ab initio gene models). This is a partition strategy unique to funannotate as the default way of running it is to split contigs by size and overlap was leading to splitting gene models.
So do you have some examples where it partitioned in the middle of a gene model?
Sure - I noticed the problem because one of the predicted terpene synthases disappeared when I ran predict (I'm comparing to output from an older version of funannotate which showed the presence of this terpene synthase):
HiQ gene model is gene.g10768.t1, location is 6,880,448 <- 6,881,996. (-) on Chr4. Partitions are: Chr4_6880672-6974571, Chr4_6736616-6881172. When I open both the evm.log files in each partition directory, I can't find mention of this gene model (or any of the other gene models associated with this gene, there is also Genemark, pasa, snap, Glimmer)
The next partition over also disrupts the gene models: The hiq gene model gene.g10833.t1 runs from 7,088,197 <- 7,089,885 (-) on Chr4 Partitions at: Chr4_6974071-7088791 Chr4_7088291-7209543
Note, neither of these genes are repeat elements.
I'll attach the EVM log here: funannotate-EVM.log
And the output of funannotate check --show-versions: Maybe I should note, I used mamba to install funannotate from the bioconda channel.
You are running Python v 3.7.9. Now checking python packages... biopython: 1.76 goatools: 1.0.15 matplotlib: 3.3.4 natsort: 7.1.1 numpy: 1.20.0 pandas: 1.2.1 psutil: 5.7.0 requests: 2.25.1 scikit-learn: 0.24.1 scipy: 1.5.3 seaborn: 0.11.1 All 11 python packages installed
You are running Perl v b'5.026002'. Now checking perl modules... Bio::Perl: 1.007002 Carp: 1.38 Clone: 0.42 DBD::SQLite: 1.64 DBD::mysql: 4.046 DBI: 1.642 DB_File: 1.855 Data::Dumper: 2.173 File::Basename: 2.85 File::Which: 1.23 Getopt::Long: 2.5 Hash::Merge: 0.300 JSON: 4.02 LWP::UserAgent: 6.39 Logger::Simple: 2.0 POSIX: 1.76 Parallel::ForkManager: 2.02 Pod::Usage: 1.69 Scalar::Util::Numeric: 0.40 Storable: 3.15 Text::Soundex: 3.05 Thread::Queue: 3.12 Tie::File: 1.02 URI::Escape: 3.31 YAML: 1.29 threads: 2.15 threads::shared: 1.56 All 27 Perl modules installed
Checking external dependencies... PASA: 2.4.1 CodingQuarry: 2.0 Trinity: 2.8.5 augustus: 3.3.3 bamtools: bamtools 2.5.1 bedtools: bedtools v2.30.0 blat: BLAT v36 diamond: 2.0.6 ete3: 3.1.2 exonerate: exonerate 2.4.0 fasta: no way to determine glimmerhmm: 3.0.4 gmap: 2017-11-15 hisat2: 2.2.1 hmmscan: HMMER 3.3.2 (Nov 2020) hmmsearch: HMMER 3.3.2 (Nov 2020) java: 11.0.8-internal kallisto: 0.46.1 mafft: v7.475 (2020/Nov/23) makeblastdb: makeblastdb 2.10.0+ minimap2: 2.17-r941 proteinortho: 6.0.28 pslCDnaFilter: no way to determine salmon: salmon 0.14.1 samtools: samtools 1.10 signalp: 5.0b snap: 2006-07-28 stringtie: 2.1.4 tRNAscan-SE: 2.0.7 (Oct 2020) tantan: tantan 13 tbl2asn: no way to determine, likely 25.X tblastn: tblastn 2.10.0+ trimal: trimAl v1.4.rev15 build[2013-12-17] trimmomatic: 0.39 ERROR: emapper.py not installed ERROR: gmes_petap.pl not installed
Last, I don't know if this is useful or not, but here's a screenshot from geneious comparing old annotations (top track) and new (bottom track, with gene models in green underneath). This is how I'm finding the dropped genes. Every time I run this genome I'm about 1,000 genes short of what was previously obtained and I'm just trying to figure out how EVM decides to drop genes.
Thanks for your time and help!
Hmm, so the the logic of the script is supposed to be finding intergenic regions that are at least 2kb, it then splits in the middle of that that this sort of thing doesn't happen. In order to do this it actually pulls out all of the genes coordinates and then merge any overlapping models to create sort of "super genes" and then it uses that to find regions to split/partition. So that file is located in predict_misc/EVM/genes.PID.bed
, would you be able to send me that file? Or better yet would be to import that into Geneious and show this same region?
If you have enough RAM you can pass --no-evm-partitions
which will run each contig/chr as a separate EVM job. Should prevent this from happening, but I'd like to fix the underlying issue. I see that that isn't in the help menu.... I'll fix that.
Thanks for the quick reply.
Here's the same region again from Geneious with genes.PID.bed in grey. I'll try running things with no EVM partitions next. Cheers
Thanks -- okay, something is wrong here with the merging of the gene models, there should be a single "super gene" at each "locus". So I think I know where to try to fix in the method -- thanks. I'll let you know when its fixed.
Hi @wittetom -- I've not tracked down exactly why this failed -- I added a few more debugging steps in the latest commit that might help figure out why. I've exposed a --evm-partition-interval
option that will control the minimum length gap between genes that it will use, default is 2000 -- so you can increase this value to pick partitions where there is a larger space between the "super genes". I've also increased the overlap in these partitions slightly to see if that has an effect on EVM. The caveat here is that larger --evm-partition-interval
will result in EVM running slower due to larger chunk sizes as it won't make partitions as frequently depending on the gene density (for fungi this is relatively high). There will also be a new "super genes" bed file that is written to the predict_misc/EVM
directory -- these are the coordinates it uses to find partitions. I did not change the underlying code for this (it was working correctly) but wrote it to file so we could troubleshoot.
So it would be helpful for me if you had time to run this a few different ways on your data -- ie first try to just re-run as default and see if the increase in overlaps is enough to fix those gene calls near the ends of the partitions, then would be to try a few larger values for --evm-partition-interval
to see if there is a sweet spot for accuracy/speed. The slowest but most accurate would be to pass the --no-evm-partitions
option.
Thanks @nextgenusfs . Truly sorry for the delay in responding, I had to turn away from the genomes last week unfortunately.
First, I reran with --no-evm-partitions (before your update). The EVM step takes 10 hours this way. (44Mb genome, running with 24 cpus, 74G ECC RAM and older processors - dual Xeon X5680). All genes of interest were accounted for this way, as expected (this added +379 genes to the final output and solved the missing terpene synthase problem).
Next, I updated to the latest commit and then reran with your new default EVM partitioning parameters. The run-time for EVM is now only 25 minutes (virtually the same as before the update), however some genes are still being dropped. That said, there are now +270 more genes than before the update. I think this is probably because of the enlarged overlap. I checked the supergenes bed file and there are still supergenes being called where the dropped genes are. I think the supergenes are being correctly assigned, but the partitioning is problematic.
Next, I tried enlarging the --evm-partition-interval to 3000 bp. EVM now takes 10 hours as before, but there are still some dropped genes (it called +340, I'm not sure which genes got dropped though). I see it didn't detect partition sites in Chr1, which is ~12Mbp.
I can really see the value in this partitioning script functioning properly - let me know how I can help further.
Cheers Tom
Thanks this is helpful. I think I have idea of what I want to try next, hopefully have time this week. I will let you know when I have something new to try.
Hi @wittetom -- I pushed some updates to the EVM partitioning scheme. So previously I was trying to find these "super gene free" regions > 2kb and use those as partition sites, the code was then taking the midway point between the adjacent gene model locations and then adding some overlap (length / 3) in each direction. So I guess EVM needs more DNA context for some of these regions. So I've now got the code to essentially overlap the entire gene free region in the adjacent partitions, so this should hopefully fix the issue. I ran some local tests and determined that I was getting the same results from with --no-evm-partitions
and the new code even when I reduced the interval size to 1500
-- which would effectively be ensuring that EVM has access to at least 1500 bp outside of the gene model predictions for the gene on the end of the partition. Reducing the interval to 1500 also resulted in a performance gain.
If you would be able to try out the new code and let me know how it performs in relation to your data with --no-evm-partitions
that would be helpful. Thanks!
Hi @nextgenusfs , so far so good - I ran the latest commit and the genes I'm interested in were all called, EVM was very fast. I'll check more thoroughly soon and let you know if I detect any issues. Thanks for your help!
First, thanks for funannotate. I'm a new user.
I've installed funannotate v1.8.3 using Conda (downgraded BUSCO to v3.3.3).
It appears gene models are spanning what is a mere 500bp overlap between the EVM partitions when I run funannotate predict. I thought EVM was partitioning after every 35 genes in a way which avoids spanned genes (I see the overlap size should be 10,000?) Does this mean the funannotate-runEVM.py isn't working properly? Is there a way to adjust the overlap size?
here's an example of my command
funannotate predict -i Fp133.masked.contigs.fasta -o Fp133_training --busco_seed_species fusarium --busco_db hypocreales_odb10 --cpus 24 --weights codingquarry:0 pasa:0 augustus:1 genemark:1 hiq:2 snap:0 GlimmerHMM:0 -s Fp133
Thanks for your help! Tom