rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

Advice to remove annotated transposases #194

Closed V-JJ closed 1 year ago

V-JJ commented 1 year ago

Hello!

We are working with a couple of highly repetitive genomes (>60%) from related non-model species. Those were assembled at either chromosome-level or with long-reads only. Once masked and structurally annotated, we found that repeat related proteins such transposases have been annotated.

So here are our questions: 1) Is the annotation of this kind of proteins is expected to some extent? 2) Our input repeat library was built using RepeatModeler predictions from chromosome-level genomes only. Does it make sense to include the info from all other genomes (assembled with long-reads)?

Many thanks! Any advice or correction will be appreciated,

rmhubley commented 1 year ago

Thanks for the question. When you annotate/mask with a transposable element tool you run the risk of matching related (or even derived) host genes as the TE families contain coding sequences. As you point out there are some known cases of exaptation of transposases from DNA transposons (e.g. https://pubmed.ncbi.nlm.nih.gov/33602827/) although the vast majority of TE families do not overlap with host coding genes. In terms of the intronic annotations, it is typical to run TE annotation prior to running a genefinder for exactly the issue you point out. For the question about building libraries, it's hard to say what the best strategy is. It really depends on your goals, but it is highly likely there are lineage-specific TE families in the other species that you will miss by not running a de-novo tool on each species (regardless of assembly level).

V-JJ commented 1 year ago

Understood. Many thanks for the information and the reference! Have a nice day!