output files explanation

DiegoZavallo commented 5 years ago

Hi Evan, Y ran tephra: nohup tephra all -c tephra_config.yml & since I want al types of TEs. I use a repbase.fasta from Solanum tuberosum for the repeatdb and left the others databases with the default options. I'm having trouble to understand the differents output files, specially regarding the LTRs. Could you explain me the differences between "elements" and "singleton"? Why are they in separated files? What are the 2336 families and why the don't match with the elements or singletons? this is from the log: nano nohup.out

INFO - Results - Total number of Gypsy elements: 5003
... INFO - Results - Number of Gypsy families: 2336
INFO - Results - Number of Gypsy elements in families: 2307
INFO - Results - Number of Gypsy singleton families/elements: 2696
INFO - Results - Number of Gypsy elements (for debugging): 5003

If I want it to use this information to search for copies in the genome, which set should I use to blastn them. Is one of these the consensus members of de different families?

And another issue is that I have 0 nonLTRs match, nor LINEs, or SINEs.

INFO - Command - 'tephra findnonltrs' started at: 05-11-2018 18:20:12. [WARNING]: No non-LTR elements were found on the forward strand. Will search reverse strand. [WARNING]: No non-LTR elements were found on the reverse strand. [WARNING]: No non-LTR elements were found so none will be reported. INFO - Command - 'tephra findnonltrs' completed at: 05-11-2018 22:27:29.

I know for a fact that there are SINEs and LINEs, actually the repbase from GIRI which I use have LINEs and SINEs. Why do you thinks that happend?

Best

Diego

sestaton commented 5 years ago

Hi Diego,

I understand the confusion because I have seen the same issue. You are correct, there should be non-LTR elements reported for common plant genomes.

The non-LTR finding method is based on the program MGEScan-nonLTR, which I re-wrote almost entirely because I had so much trouble interpreting the results or even getting it work. Unfortunately, it appears that the issues are not completely resolved.

This is a difficult one, but it is high on my list to resolve ASAP. I'll file this a bug and get back to this issue when I can.

Thanks for the report, Evan

DiegoZavallo commented 5 years ago

Thanks Evan for your quick response, And how about the first question regarding the LTRs out files (maybe I should wrote two separate issues)

I ran tephra: nohup tephra all -c tephra_config.yml & since I want al types of TEs. I use a repbase.fasta from Solanum tuberosum for the repeatdb and left the others databases with the default options. I'm having trouble to understand the differents output files, specially regarding the LTRs. Could you explain me the differences between "elements" and "singleton"? Why are they in separated files? What are the 2336 families and why the don't match with the elements or singletons? this is from the log: nano nohup.out

INFO - Results - Total number of Gypsy elements: 5003
...
INFO - Results - Number of Gypsy families: 2336
INFO - Results - Number of Gypsy elements in families: 2307
INFO - Results - Number of Gypsy singleton families/elements: 2696
INFO - Results - Number of Gypsy elements (for debugging): 5003

If I want it to use this information to search for copies in the genome, which set should I use to blastn them. Is one of these the consensus members of de different families?

sestaton commented 5 years ago

Sorry, I missed this part of the question.

"Families" here are multi-copy groups (2336 in this report) and "singletons" (2696 in this report) are elements not grouped with any family. For most plant genomes you see a distribution where the most common family size is 1, meaning there are many small families and a few larger ones (I'll update with a reference).

So, if you add the "elements in families" to the "singletons" you'll get the total number:

2307 + 2696 = 5003

The last nice is just a check (mostly for me) to make sure the numbers are correct after all the classification and annotation steps are complete.

BTW, this will be documented in detail in the manuscript to make it clear what all of these numbers are and how they are derived, and this is a good reminder so thanks for asking.

DiegoZavallo commented 5 years ago

I see... but what I don't understand is why there are more multi-copy families than "elements in families"? 2336 > 2307. If each of the 2336 families has to have at least 2 elements not to be considered singletons, the "elements in families" should be at least 4672 (2336*2) right? Or am I thinking it wrong? Actually in the others TEs types, all "elements in families" are more than twice than the families

INFO - Results - Number of Helitron families: 253 INFO - Results - Number of Helitron elements in families: 586 .. INFO - Results - Number of Mutator families: 131
INFO - Results - Number of Mutator elements in families: 412
.. INFO - Results - Number of hAT families: 4
INFO - Results - Number of hAT elements in families: 14
.. INFO - Results - Number of MITE families: 104 bett INFO - Results - Number of MITE elements in families: 707
.. NFO - Results - Number of Tc1-Mariner families: 102
INFO - Results - Number of Tc1-Mariner elements in families: 228

Could be something wrong with the LTRs specifically? The same happens with the Copia LTRs...

And also...

BTW, this will be documented in detail in the manuscript to make it clear what all of these numbers are and how they are derived, and this is a good reminder so thanks for asking.

I'm looking forward to see your paper! You did an excellent work compiling several tools, classify and annotate TEs! Actually we recently published (a month ago) a MITE discovery tool called MITE Tracker (https://www.ncbi.nlm.nih.gov/pubmed/30285604) which gave better results than the others MITE discovery tools available and can work with large genomes. I noticed that you didn't incorporate an specific tool for that type of TE (From the teprha findtirs command:

Mark short elements with no coding potential as MITEs

If you are interested, you are more than welcome to see it and incorporate it to tephra and ask as any question regarding the the algorithm we used.

Best

Diego

sestaton commented 5 years ago

Thank you for the clarification, I misunderstood to real issue. Indeed, this is a bug. I did some testing yesterday and I can recreate this issue in some cases. I believe this is a logging problem and not a problem with the annotations, but I should be able to resolve the issue today and make a new release.

Thank you for the reference for MITE Tracker! I will take a look for sure, this work should be very helpful for this project.

DiegoZavallo commented 5 years ago

Hi Evan, I've been checking the outputs files with the logging and I found others inconsistencies on the TEs. For instances, on the Tc1-Mariner TIRs folder the DTT_singletons.fasta file has 1284 sequences while in the log file counts 1315. And the same happens with the DTT_families.fasta, which has 259 sequences, and the log counts 228. However the sum on both gaves the same: 1543.

Should I wait for the new release and try to run it all over again, or if I want the sequences (classified into the different families) to find more copies in the genome (by blasting), I just cat both files and use that?

Thank you for the reference for MITE Tracker! I will take a look for sure, this work should be very helpful for this project

Cool! Let us know if you have any question

DiegoZavallo commented 5 years ago

Sorry to bother you again, but where can I find or filter out these sequences from the unclussified LTRs but with protein domains matches

INFO - Results - Number of unclassified LTR-RT elements with protein matches: 2999

Because you also have these:

INFO - Results - Number of unclassified elements (for debugging): 19638

but I think I want to use the classified ones plus the unclasssifed with protein matches for the copies search with blastn. but I dont know how to filter them from the potato_dm_v404_all_pm_un_tephra_ltrs_trims_unclassified_complete.fasta file

sestaton commented 5 years ago

Thank you for the patience. I have made a new release today (v0.12.3) that should address all the issues about the family/element numbers (new features have also been added but the usage is the same).

Concerning the issue with non-LTRs, I do not think this is a bug because it works fine for some species, such as Arabidopsis. I believe the problem is that elements in some species have too high a divergence from the models being used. This will take more research, and I'll likely create a new issue for that.

Please let me know if the other questions/issues have been resolved with the latest changes.

sestaton commented 5 years ago

I believe all of the issues described above have been resolved in v0.12.4 (specifically, the issue with the non-LTRs).

Please let me know if there are any more questions or if I missed something. I'll leave this one open for a while or until I get a response. Thanks.

DiegoZavallo commented 5 years ago

Hi Evan! Thanks for your reply. I'll definitively try out the new version as soon as I can and contact you with the results

Thanks again, cheers

Diego

sestaton / tephra

output files explanation #33