Closed pyrevo closed 7 years ago
Hi, Thanks for the message. It appears the RepBase format has changed unfortunately. I will have to download this file and adjust the formatting script. I will do this today. In the mean time, you might want to try with an older RepBase library (v18.01 for example) to see if you get the same results.
Wait, I just noticed something. You said you downloaded the RepBase libary for RepeatMasker, but that is formatted specifically for that tool. Please try to download the regular RepBase library and let me know the sequence ID format. If it looks like this:
>GYPSY68-LTR_AG Gypsy Anopheles gambiae
Then you should be ready to use that directly with no formatting.
Hi, thank you for the support! I have checked the regular RepBase library and headers exactly looks like your example. I have also ran several analysis and I have obtained perfect outputs now.
Any advice about using de novo libraries from RepeatModeler? Headers looks like that:
rnd-2_family-72#DNA/CMC-EnSpm ( Recon Family Size = 136, Final Multiple Alignment Size = 95 )
I suppose that formatting the headers as you have shown me, it could do the trick. Thank you again.
I'm glad the format hasn't changed! Let me know if you have any questions about the output.
I'm not sure what would be gained by using RepeatModeler over RepBase but the approach would be to format it the same. That might introduce artifacts of RepeatModeler is my only concern. Though, if you have a need for this you can file another issue and I'll mark it as a feature request for supporting the format.
Yeah, I didn't think about it! Ok, I'll do some tests and let you know. Thank you very much for your kind help, really appreciate it.
Hi,
Please note that I've updated the format_database.pl script and it should work with this data format now. If there are issues, please let me know.
Thanks, Evan
Hi @sestaton,
I would like to thank you very much for your support. I'm trying to do some test formatting my RepeatModeler database with the new script and I get this error:
Can't locate object method "map_superfamily_name" via package "Transposome::Annotation" at format_database.pl line 63,
line 1.
Transposome and its dependencies are correctly working on the machine on which I have ran it.
Thank you, Massimiliano
Hi Massimiliano,
You will need to install the latest version from the master branch. Sorry, I didn't mention that. This is a method I just added yesterday.
Let me know if there are any issues.
Thanks, Evan
I'm sorry for that and thank you to pointed it out.
I have updated my installation and now the script works. However, when I try to convert the RepeatModeler library, I obtain errors like that:
[WARNING]: Could not get 3-letter code from: =/1. Skipping.
[WARNING]: =/1 does not seem to match known TE superfamilies
...
and only unknown elements appears in the parsed library, with headers like that:
>=/1 Unknown_repeat genus species
>=/5 Unknown_repeat genus species
...
I don't know if it could help, but I ran RepeatModeler with this command:
./BuildDatabase -name my_species -engine ncbi my_species.fa
./RepeatModeler -engine ncbi -pa 5 -database my_species
Finally, headers from my RepeatModeler library looks like that:
>rnd-6_family-1211#Unknown ( Recon Family Size = 18, Final Multiple Alignment Size = 15 )
>rnd-6_family-158#RC/Helitron ( Recon Family Size = 16, Final Multiple Alignment Size = 15 )
Thank you, Massimiliano
I believe there are two issues here. 1) The sequence ID format is not being parsed correctly because the assumption was it would look like Illumina data. This is a bug. 2). The format of the repeat library appears to be different than what was expected based on another issue.
I'm working on fixing both of these issues and it should be done real soon.
Okay, to test the changes update Transposome, download the latest version of that formatting script, and please try again. I did a small test on what you provided and I think this should solve the problem. Let me know if the parsing issue is resolved, and also the classification problem (they should not all be 'unknown').
The format_database.pl script now works like a charm with my RepeatModeler library!
These are examples of headers obtained after the parsing:
>rnd_2_family_34#Simple_repeat Simple_repeat genus species
>rnd_2_family_72#DNA/CMC_EnSpm CMC_EnSpm genus species
>rnd_2_family_146#SINE/tRNA tRNA genus species
>rnd_2_family_41#Unknown Unknown genus species
>rnd_2_family_256#LINE/RTE_BovB RTE_BovB genus species
I'll do some tests running Transposome and I'll let you know soon.
Thanks for your support, Massimiliano
Hi @sestaton,
I'm sorry for the delay but I have found the time to do some analysis to test the new features.
I have done four analysis with two different dataset using the standard RepBase database and the formatted RepeatModeler database obtained through the new format_database.pl script. So I have:
interleaved_100K.fastq.gz + RepBase22.04.fa interleaved_100K.fastq.gz + formatted_Consensi.fa.classified interleaved_1M.fastq.gz + RepBase22.04.fa interleaved_1M.fastq.gz + formatted_Consensi.fa.classified
The program works very well with both the libraries, so I think the issue is solved.
Thank you very much for your kindly support and for this very useful tool. Sorry again for the delay,
Massimiliano.
No worries on the delay, and thank you for reporting back! I'm glad it works and it helps to know it is resolved.
Feel free to comment again or open a new issue if there are other questions.
Hi @sestaton, first of all thank you very much for this tool. I would like to do some analysis using Transposome so after I have prepared my input data I have ran it without errors. The problem is that if I look at the summary file it is empty. I'm working with the RepBase library for RepeatMasker (v20.05 – Release 20150807). Headers are something like that:
I know you provide a script to format a FASTA file of repeats consensus for Transposome compatibility (format_database.pl) so I have tried to run it on the RepBase library. What I obtain during the script run are errors like that:
If I try to do a new Transposome analysis using this new library, only Gypsy LTR retrotransposons are present in the summary file in output. Can you please help me to understand how I have to correctly format my database of repeats headers in order to run a complete analysis with Transposome?