nextgenusfs / funannotate

Eukaryotic Genome Annotation Pipeline
http://funannotate.readthedocs.io
BSD 2-Clause "Simplified" License
300 stars 82 forks source link

Two step repeat masking (partial RepeatModeler output + tantan), reasonable approach? #929

Open igwill opened 1 year ago

igwill commented 1 year ago

Hi, I've been reading some discussions on repeatMasker and repeatModeler and gathered that there isn't a perfect speedy solution for masking fungi. But, wanted to get some input on if conservatively applying repeatModeler (using a new de novo Hifiasm assembly on an ascomycete to build the database) followed by a second pass with tantan should work well downstream with the Funannotate pipeline. Basically, trying to avoid masking true protein coding sequences and making spurious repeat calls. The end goal of this genome project is not about the repeats themselves, but more to help map RNAseq, compare protein-coding genes between spp., etc.

Since repeatModeler classifies some repeats, and leaves others it can't recognize as Unknown - I was thinking to use only recognized families, which should be a safer bet, I think? This, to avoid manually curating the repeat library and going down another rabbit hole of homology searches against various TE databases and such for this and other new genomes in this project.

funannotate v 1.8.15, repeatMasker v 4.1.5, repeatModeler v 2.0.4 (all installed via mamba in one environment)

I don't have access to RepBase. I did download DFAM 3.7 Dfam.h5 (all un-/curated), and replaced the default Dfam.h5 in repeatMasker's /Libraries/ with that. Although fungi still don't appear to be represented, as double-checked with: famdb.py -i ./Libraries/Dfam.h5 families --ancestors --descendants 4751 (Fungi) # gives just 9 ancestors that are cloning vector artifact seqs So not planning to grab any sequences to put into a custom library fasta, but my understanding is that the classification step of repeatModeler will look at repeatMasker's Dfam.h5 at least. (I went ahead and tried funannotate mask with -m repeatmasker -s fungi, having seen that elsewhere - but I just get an error where no masked assembly is made and then the script can't find its own output file and breaks.)

1. repeatModeler gives 149 families found after the third round using my assembly fasta as the input db. Looking at the mygenome-families.fa output, 101 of those are Unknown type. I then tested masking with this full library, and a version where I only used sequences that could be associated to a known repeat family (the other 48 - Copia, Gypsy, etc.).

funannotate mask using -m repeatmasker -l mygenome-families.fa, masks 33% with everything. funannotate mask using -m repeatmasker -l UnknownsRemoved_mygenome-families.fa, masks 28% with only classified repeats. So not a major change really, the Unknowns must not be very common/large repeats.

2. If I were to only use tantan on the initial assembly, 9% is masked. If I use tantan on top of the UnknownsRemoved masked assembly (28% masked to start) that gives just under 33% after. Does proceeding with this double-masked assembly seem reasonable? Or could stacking methods here be over doing it?

Thanks!