filtering repeat modeler library for real genes

EarlyEvol commented 6 years ago

Hi Jon,

I have a general repeat masking strategy question. There are many best practice guides that have some method of screening Repeat Modeler libraries for "real" genes that just happen to have > 6 copies. I have tried several strategies utilizing a transcriptome and blast searching, basically remove RepBase seqs from transcripome, then remove remaining transcripts form Repeat modeler lib. In the end, it looks like I would be removing lots of repeats that are probably real because they are classified or clearly tandem repeats. My question is: if I don't filter the repeat lib, how will over masking effect annotation of these repetitive genes? Are masked regions just not used for training, but are annotated after models are built? In general do you suggest just using the raw Repeat Modeler output? I think a lot of my troubles stem from my RNA-seq having some genomic contamination, which is showing up as "transcripts" that align to repetitive stuff.

I suppose another strategy would be to use diamond/uniprot to remove putative gene fragments from the repeat lib....

Any suggestions are greatly appreciated.

Thanks, Earl

nextgenusfs commented 6 years ago

Hi Earl,

Good question -- masking of repeats is both important and frustrating as I don't know of a method that works perfectly for a variety of organisms. Funannotate will use the repeats slightly differently than other pipelines, i.e. maker, in funannotate repeats are soft-masked, so the sequences are not completely masked from the gene predictors. When possible, funannotate uses the softmasking features of genemark and augustus, however, this doesn't mean that genes won't be predicted in those regions. There are currently a few different ways to run funannotate predict in relation to repeats. However, the basic strategy is to softmask repeats, run gene predictors giving them the softmask information, and after prediction remove putative repeat predictions by a targeted homology search and/or if gene model is >90% contained in a repetitive region. The risk in not masking repeats is that the gene predictors can apparently get "stuck" in these regions if used for training. Currently, funannotate predict will issue a warning and die if your assembly is not repeat masked, you can bypass this with --force and run the pipeline with a non-repeat masked assembly as well.

The different methods in funannotate predict are: 1) Default: do not use repeats in EvidenceModeler, filter repeats with a blast search of curated protein repeats, and filter gene models based on overlap with masked regions. 2) --repeats2evm -- this option will pass the repeats to EvidenceModeler, this option results in fewer gene model predictions, i.e. more similar to how Maker treats the repeats. 3) --repeat_filter blast -- skip overlap filtering of gene models 4) --repeat_filter overlap -- skip blast filtering of gene models 5) --repeat_filter none -- skip both blast and overlap filtering

So in summary, I don't have a "concrete" answer to give you other than you may need to run it several ways to find the best options for your genome. Hopefully you have some prior knowledge about what models you are interested in, copy numbers, etc -- if you have a set to specifically look for you can determine what settings may work the best for your particular genome.

RepeatModeler is one method, is it the best? I really don't know. Its widely used, however, the underlying RepBase database used by both RepeatModeler/Masker is moving towards a subscription licensing model -- which might make using this tool less than ideal. You can use any method to softmask your assembly, funannotate predict is basically just looking for some regions that are softmasked -- you could run something like tantan to only mask simple repeats for example. RepeatModeler/Masker is incredibly slow, so I wanted to give users flexibility in this step, hence the funannotate mask step.

Good luck! Jon

EarlyEvol commented 6 years ago

Jon,

Wow, thanks for the super speedy and thorough reply!
Since your answer wasn't that I'm crazy, I think in general I should just stop worrying about the infinite options and move forward (That's a real struggle for me!). For the first drafts of my genomes and annotations, I will take the unfiltered RM lib, then later do some comparative stuff to figure out how to annotate them more accurately. I have already run through funannotate once and the results look good, but that was done with a messy genome contaminated transcriptome. I'll run through it again with a cleaned up transcriptome and check out the results. Thanks again for your awesome program and all the support/advice!

Earl

nextgenusfs / funannotate

filtering repeat modeler library for real genes #238