Library of custom repeats without superfamily name

etwatson commented 8 years ago

I have assembled a repeat library for my organism from de novo assembly of reads as well as with RepeatScout. Many can be identified at the level of order by structural features, but have no/ambiguous similarities at the superfamily level.

It looks like Transposome requires all repeat libraries to belong to a preexisting superfamily (Typemap.pm), is that correct?

How might I be able to analyze the genome fraction of these TEs?

sestaton commented 8 years ago

The direct approach would be to BLAST a repeat library to your sequences and map the best matches to the superfamily for each sequence in your library. The repeat library format for Transposome follows the RepBase format, as below:

>GYPSY68-LTR_AG Gypsy   Anopheles gambiae
>BEL1-I_AG  BEL Anopheles gambiae
>BEL2-LTR_AG    BEL Anopheles gambiae
>GYPSY1-LTR_AG  Gypsy   Anopheles gambiae
>Copia-7_AG-LTR Copia   Anopheles gambiae
>MTANGA_I   Copia   Anopheles gambiae
>GYPSY32-LTR_AG Gypsy   Anopheles gambiae
>PegasusA   hAT Anopheles gambiae
>DNA-2_AG   DNA transposon  Anopheles gambiae
>Clu-15B_AG DNA transposon  Anopheles gambiae

where you have ">repeat_name superfamily genus species" in the header. As you can see for some, only "DNA transposon" is provided because the repeats were not annotated to a finer level. That isn't going to cause problems, but having more fine-grained annotations like "Mutator" would give more insightful results because "DNA transposon" could mean a lot of things. For example, MuDR, Helitron, hAT, or something else.

I would use a combination of BLAST to find highly similar sequences (this should work well for the superfamily level), and protein and pHMM matches with HMMER. Likely, all you need is blastn, and if that does not provide ample evidence at the superfamily level then you may be trying to annotate artifacts, so be aware of that.

etwatson commented 8 years ago

I may be able to annotate some of my TEs with BLAST, but there will likely be multiple good hits or other conflicting evidence. I already used PASTEC to classify my TEs, which uses blastm, tblastx, and blastp for similarity based searches. It also uses HMMER for profile based searches in addition to other profile based methods. So, some TEs have been identified at the class/subclass level, but not with BLAST/similarity based methods which provide the superfamily classification.

These are TEs that RepeatMasker estimates cover 15.42% of the genome.

So, for these, I used headers like this(which do not work):

>LTR_21782  LTR Tigriopus californicus
>RNA_21848  RNA Tigriopus californicus
>DNA_22070  DNA Tigriopus californicus
>LINE_23703 LINE    Tigriopus californicus

sestaton commented 8 years ago

The repeat format used by Transposome follows the widely accepted "unified" schema used in this Nature paper. The format you posted is similar to RepBase but does not reflect any annotations (as I mentioned above, "DNA" is not descriptive enough to be useful). What you want is to start with a database of repeats from a closely related species, or you can create your own but this will only be as useful as the information you put into the analysis.

The superfamily level is conserved across orders of eukaryotic taxa and I assure you that you can get much more specific than "LTR" or "DNA" with blastn/blastx, though you will find some conflicts and high levels of divergence. If you are seeing lots of conflicts at the superfamily level or higher, then I would discard those predictions as being artifacts. I personally don't like RepeatScout for the reason that it builds "repeats" from k-mer similarities from all over the genome and tries to extend them. Therefore, you are not annotating a locus, rather it is an assembly of things that are similar in the genome. And, of course, you will find lots of conflicts at the annotation stage because the repeat contig is actually composed of many different things. I would be cautious using this type of data for actual annotation. At the level you provide above (e.g. LTR or DNA), you can probably be confident in the predictions but if you try to assign finer annotations you will be incorporating errors, and this may be why the tools used only provide that level of classification? For getting high level estimations of composition, this works well though.

sestaton commented 8 years ago

I meant to provide the wiki link for the repeat database format. Also, if you want suggestions or a script for mapping blast hits to create this format I would be happy to help.

sestaton commented 8 years ago

After re-reading you question I have a better understanding of what you are trying to do. It sounds like you are trying to use the reads the identify repeats from an assembly and RepeatScout, then use this with Transposome.

I would strongly advise against this approach. Assembly methods are highly biased against repeats, so this will generate artifacts, as will using RepeatScout, but for different reasons (discussed above). The main idea behind Transposome is to avoid these issues, so combing all approaches in this manner is not beneficial and it is also circular. Also, Transposome is designed to work with raw reads, so any other input will likely lead to incorrect estimates of repeat abundance.

I recommend just using Transpsome alone with RepBase, combined with repeat libraries from closely-related species. I don't want to discourage using other programs, but I can say for certain that the specific approach we are discussing should not be used with Transposome. I am going to close this issue, but feel free to comment/discuss further, especially if you still want to create a custom library, for which I could add a script to the 'transposome-scripts' repo that would be of general help to everyone.

sestaton / Transposome

Library of custom repeats without superfamily name #33