sestaton / Transposome

A toolkit for annotation of transposable element families from unassembled sequence reads
http://sestaton.github.io/Transposome
MIT License
31 stars 6 forks source link

How to format the repeat database for Transposome analysis #37

Closed pyrevo closed 7 years ago

pyrevo commented 7 years ago

Hi @sestaton, first of all thank you very much for this tool. I would like to do some analysis using Transposome so after I have prepared my input data I have ran it without errors. The problem is that if I look at the summary file it is empty. I'm working with the RepBase library for RepeatMasker (v20.05 – Release 20150807). Headers are something like that:

BRENSPM1#DNA/CMC-EnSpm @Brassica_rapa [S:] RepbaseID: BRENSPM1

I know you provide a script to format a FASTA file of repeats consensus for Transposome compatibility (format_database.pl) so I have tried to run it on the RepBase library. What I obtain during the script run are errors like that:

[ERROR]: Gypsy-9_CPB-LTR#LTR/Gypsy does not seem to match known TE superfamilies [ERROR]: Harbinger-1_CPB#DNA/Harbinger does not seem to match known TE superfamilies [ERROR]: hAT-10_CPB#DNA/hAT-Charlie does not seem to match known TE superfamilies

If I try to do a new Transposome analysis using this new library, only Gypsy LTR retrotransposons are present in the summary file in output. Can you please help me to understand how I have to correctly format my database of repeats headers in order to run a complete analysis with Transposome?

sestaton commented 7 years ago

Hi, Thanks for the message. It appears the RepBase format has changed unfortunately. I will have to download this file and adjust the formatting script. I will do this today. In the mean time, you might want to try with an older RepBase library (v18.01 for example) to see if you get the same results.

sestaton commented 7 years ago

Wait, I just noticed something. You said you downloaded the RepBase libary for RepeatMasker, but that is formatted specifically for that tool. Please try to download the regular RepBase library and let me know the sequence ID format. If it looks like this:

>GYPSY68-LTR_AG Gypsy   Anopheles gambiae

Then you should be ready to use that directly with no formatting.

pyrevo commented 7 years ago

Hi, thank you for the support! I have checked the regular RepBase library and headers exactly looks like your example. I have also ran several analysis and I have obtained perfect outputs now.

Any advice about using de novo libraries from RepeatModeler? Headers looks like that:

rnd-2_family-72#DNA/CMC-EnSpm ( Recon Family Size = 136, Final Multiple Alignment Size = 95 )

I suppose that formatting the headers as you have shown me, it could do the trick. Thank you again.

sestaton commented 7 years ago

I'm glad the format hasn't changed! Let me know if you have any questions about the output.

I'm not sure what would be gained by using RepeatModeler over RepBase but the approach would be to format it the same. That might introduce artifacts of RepeatModeler is my only concern. Though, if you have a need for this you can file another issue and I'll mark it as a feature request for supporting the format.

pyrevo commented 7 years ago

Yeah, I didn't think about it! Ok, I'll do some tests and let you know. Thank you very much for your kind help, really appreciate it.

sestaton commented 7 years ago

Hi,

Please note that I've updated the format_database.pl script and it should work with this data format now. If there are issues, please let me know.

Thanks, Evan

pyrevo commented 7 years ago

Hi @sestaton,

I would like to thank you very much for your support. I'm trying to do some test formatting my RepeatModeler database with the new script and I get this error:

Can't locate object method "map_superfamily_name" via package "Transposome::Annotation" at format_database.pl line 63, line 1.

Transposome and its dependencies are correctly working on the machine on which I have ran it.

Thank you, Massimiliano

sestaton commented 7 years ago

Hi Massimiliano,

You will need to install the latest version from the master branch. Sorry, I didn't mention that. This is a method I just added yesterday.

Let me know if there are any issues.

Thanks, Evan

pyrevo commented 7 years ago

I'm sorry for that and thank you to pointed it out.

I have updated my installation and now the script works. However, when I try to convert the RepeatModeler library, I obtain errors like that:

[WARNING]: Could not get 3-letter code from: =/1. Skipping.
[WARNING]: =/1 does not seem to match known TE superfamilies
...

and only unknown elements appears in the parsed library, with headers like that:

>=/1    Unknown_repeat  genus species
>=/5    Unknown_repeat  genus species
...

I don't know if it could help, but I ran RepeatModeler with this command:

./BuildDatabase -name my_species -engine ncbi my_species.fa
./RepeatModeler -engine ncbi -pa 5 -database my_species

Finally, headers from my RepeatModeler library looks like that:

>rnd-6_family-1211#Unknown ( Recon Family Size = 18, Final Multiple Alignment Size = 15 )
>rnd-6_family-158#RC/Helitron ( Recon Family Size = 16, Final Multiple Alignment Size = 15 )

Thank you, Massimiliano

sestaton commented 7 years ago

I believe there are two issues here. 1) The sequence ID format is not being parsed correctly because the assumption was it would look like Illumina data. This is a bug. 2). The format of the repeat library appears to be different than what was expected based on another issue.

I'm working on fixing both of these issues and it should be done real soon.

sestaton commented 7 years ago

Okay, to test the changes update Transposome, download the latest version of that formatting script, and please try again. I did a small test on what you provided and I think this should solve the problem. Let me know if the parsing issue is resolved, and also the classification problem (they should not all be 'unknown').

pyrevo commented 7 years ago

The format_database.pl script now works like a charm with my RepeatModeler library!

These are examples of headers obtained after the parsing:

>rnd_2_family_34#Simple_repeat  Simple_repeat   genus species
>rnd_2_family_72#DNA/CMC_EnSpm  CMC_EnSpm   genus species
>rnd_2_family_146#SINE/tRNA tRNA    genus species
>rnd_2_family_41#Unknown    Unknown genus species
>rnd_2_family_256#LINE/RTE_BovB RTE_BovB    genus species

I'll do some tests running Transposome and I'll let you know soon.

Thanks for your support, Massimiliano

pyrevo commented 7 years ago

Hi @sestaton,

I'm sorry for the delay but I have found the time to do some analysis to test the new features.

I have done four analysis with two different dataset using the standard RepBase database and the formatted RepeatModeler database obtained through the new format_database.pl script. So I have:

interleaved_100K.fastq.gz + RepBase22.04.fa interleaved_100K.fastq.gz + formatted_Consensi.fa.classified interleaved_1M.fastq.gz + RepBase22.04.fa interleaved_1M.fastq.gz + formatted_Consensi.fa.classified

The program works very well with both the libraries, so I think the issue is solved.

Thank you very much for your kindly support and for this very useful tool. Sorry again for the delay,

Massimiliano.

sestaton commented 7 years ago

No worries on the delay, and thank you for reporting back! I'm glad it works and it helps to know it is resolved.

Feel free to comment again or open a new issue if there are other questions.