Closed etwatson closed 8 years ago
The direct approach would be to BLAST a repeat library to your sequences and map the best matches to the superfamily for each sequence in your library. The repeat library format for Transposome follows the RepBase format, as below:
>GYPSY68-LTR_AG Gypsy Anopheles gambiae
>BEL1-I_AG BEL Anopheles gambiae
>BEL2-LTR_AG BEL Anopheles gambiae
>GYPSY1-LTR_AG Gypsy Anopheles gambiae
>Copia-7_AG-LTR Copia Anopheles gambiae
>MTANGA_I Copia Anopheles gambiae
>GYPSY32-LTR_AG Gypsy Anopheles gambiae
>PegasusA hAT Anopheles gambiae
>DNA-2_AG DNA transposon Anopheles gambiae
>Clu-15B_AG DNA transposon Anopheles gambiae
where you have ">repeat_name superfamily genus species" in the header. As you can see for some, only "DNA transposon" is provided because the repeats were not annotated to a finer level. That isn't going to cause problems, but having more fine-grained annotations like "Mutator" would give more insightful results because "DNA transposon" could mean a lot of things. For example, MuDR, Helitron, hAT, or something else.
I would use a combination of BLAST to find highly similar sequences (this should work well for the superfamily level), and protein and pHMM matches with HMMER. Likely, all you need is blastn, and if that does not provide ample evidence at the superfamily level then you may be trying to annotate artifacts, so be aware of that.
I may be able to annotate some of my TEs with BLAST, but there will likely be multiple good hits or other conflicting evidence. I already used PASTEC to classify my TEs, which uses blastm, tblastx, and blastp for similarity based searches. It also uses HMMER for profile based searches in addition to other profile based methods. So, some TEs have been identified at the class/subclass level, but not with BLAST/similarity based methods which provide the superfamily classification.
These are TEs that RepeatMasker estimates cover 15.42% of the genome.
So, for these, I used headers like this(which do not work):
>LTR_21782 LTR Tigriopus californicus
>RNA_21848 RNA Tigriopus californicus
>DNA_22070 DNA Tigriopus californicus
>LINE_23703 LINE Tigriopus californicus
The repeat format used by Transposome follows the widely accepted "unified" schema used in this Nature paper. The format you posted is similar to RepBase but does not reflect any annotations (as I mentioned above, "DNA" is not descriptive enough to be useful). What you want is to start with a database of repeats from a closely related species, or you can create your own but this will only be as useful as the information you put into the analysis.
The superfamily level is conserved across orders of eukaryotic taxa and I assure you that you can get much more specific than "LTR" or "DNA" with blastn/blastx, though you will find some conflicts and high levels of divergence. If you are seeing lots of conflicts at the superfamily level or higher, then I would discard those predictions as being artifacts. I personally don't like RepeatScout for the reason that it builds "repeats" from k-mer similarities from all over the genome and tries to extend them. Therefore, you are not annotating a locus, rather it is an assembly of things that are similar in the genome. And, of course, you will find lots of conflicts at the annotation stage because the repeat contig is actually composed of many different things. I would be cautious using this type of data for actual annotation. At the level you provide above (e.g. LTR or DNA), you can probably be confident in the predictions but if you try to assign finer annotations you will be incorporating errors, and this may be why the tools used only provide that level of classification? For getting high level estimations of composition, this works well though.
I meant to provide the wiki link for the repeat database format. Also, if you want suggestions or a script for mapping blast hits to create this format I would be happy to help.
After re-reading you question I have a better understanding of what you are trying to do. It sounds like you are trying to use the reads the identify repeats from an assembly and RepeatScout, then use this with Transposome.
I would strongly advise against this approach. Assembly methods are highly biased against repeats, so this will generate artifacts, as will using RepeatScout, but for different reasons (discussed above). The main idea behind Transposome is to avoid these issues, so combing all approaches in this manner is not beneficial and it is also circular. Also, Transposome is designed to work with raw reads, so any other input will likely lead to incorrect estimates of repeat abundance.
I recommend just using Transpsome alone with RepBase, combined with repeat libraries from closely-related species. I don't want to discourage using other programs, but I can say for certain that the specific approach we are discussing should not be used with Transposome. I am going to close this issue, but feel free to comment/discuss further, especially if you still want to create a custom library, for which I could add a script to the 'transposome-scripts' repo that would be of general help to everyone.
I have assembled a repeat library for my organism from de novo assembly of reads as well as with RepeatScout. Many can be identified at the level of order by structural features, but have no/ambiguous similarities at the superfamily level.
It looks like Transposome requires all repeat libraries to belong to a preexisting superfamily (Typemap.pm), is that correct?
How might I be able to analyze the genome fraction of these TEs?