nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
322 stars 85 forks source link

Compatibility with RepeatModeler 2.x #600

Open kfuku52 opened 3 years ago

kfuku52 commented 3 years ago

Are you using the latest release? yes, 1.8.8.

Describe the bug At the moment, funannotate mask does not seem to support RepeatModeler 2.x. This is because the RepeatModeler -e option in 1.x has been replaced by BuildDatabase -engine, This can be easily fixed like in this branch. https://github.com/kfuku52/funannotate/commit/9bad1296fc337049b87a94127f92bdd0fdeea186 (This branch also fixes a bug where --repeatmasker_species is not passed correctly to RepeatMasker.)

I'll be happy to create a PR, but the change in my branch loses compatibility with RepeatModeler 1.x. It seems possible to get version information from RepeatModeler's help messages to support both 1.x and 2.x. So if the 1.x support is still needed, I can update my branch so and create a PR. Please let me know your thought.

hyphaltip commented 3 years ago

I def have this bug too- I have not worked on the workaround - but if you want to provide the option to run v1 or v2 that would be better -- I am just running RM outside of funannotate for the time being but I think having this better rolled in would be helpful

nextgenusfs commented 3 years ago

I'm not sure I want to support RepeatModeler/Masker -- I've not upgraded the code for several years in relation to this because install is not straightforward as well as the RepBase database is no longer available, making masking (at least for fungi and other non-model organisms) difficult. I sort of view it as out of scope with funannotate.

kfuku52 commented 3 years ago

I don't have the RepBase subscription but RepeatMasker is still useful because the latest version comes with the out-of-the-box Dfam repeat database (see here), so Repbase is no longer necessary.

bioconda provides the recipes for both, so currently it's pretty easy to install if you don't have to manually compile it from the source code: conda install -c bioconda repeatmasker repeatmodeler. bioconda's latest RepeatModeler is 2.x.

nextgenusfs commented 3 years ago

Right, DFAM contains repeats of only 5 species.... so for my usage (fungi) it isn't helpful.

The current release (Dfam 3.2) contains 6,900 TE families spanning five organisms: human, mouse, zebrafish, fruit fly, nematode, and a growing number of additional species. To supplement this databases we recommend obtaining the RepeatMasker edition of RepBase.

kfuku52 commented 3 years ago

RepeatMasker seems to automatically download the latest Dfam release, which is currently 3.3 and contains 347 species, although I don't know how many fungal species there are.

(funannotate) [kfuku@at137 gfe_data]$ find ~/.pyenv/versions/miniconda3-4.3.30/envs/funannotate -name *Dfam*
/home/kfuku/.pyenv/versions/miniconda3-4.3.30/envs/funannotate/include/H5FDfamily.h
/home/kfuku/.pyenv/versions/miniconda3-4.3.30/envs/funannotate/share/RepeatMasker/Libraries/Dfam.h5
/home/kfuku/.pyenv/versions/miniconda3-4.3.30/envs/funannotate/share/RepeatMasker/Libraries/CONS-Dfam_3.3
nextgenusfs commented 3 years ago

Yeah, I'm not saying it isn't useful if you are annotating human, mouse, zebrafish, etc --> but most of us are trying to annotate non-model organisms so the DFAM repeat library isn't going to be very useful. RepeatModeler used to also require RepBase library in order to do the de novo predictions, I don't know if that is still the case or not.

Edit: I didn't read your message closely -- 347 species -- I thought I looked at this awhile ago and there were very few if any fungi, but perhaps worthwhile looking again.

nextgenusfs commented 3 years ago

Looks like you can browse species here: https://dfam.org/browse?clade=4890&clade_descendants=true&include_raw=true

hyphaltip commented 2 years ago

I'm going to go ahead and try and get repeatmodeler 2.x + repeatmasker 4.1 series working so refer to those branch fixes to this bug here.