soloLTR discovery and reporting

oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/

GNU General Public License v3.0

188 stars 40 forks source link

soloLTR discovery and reporting #41

Closed anandksrao closed 5 years ago

anandksrao commented 5 years ago

Hi Shujun,

I've completed by LTRharvest run, and close to completing the LTR_FINDER run as well.

In terms of solo LTR discovery and reporting, of all the Perl scripts listed under LTR_retriever/bin/, I can recognize only two Perl scripts that may be relevant: solo_finder.pl, and solo_intact_ratio.pl

From header information inside this script, it looks like solo_finder.pl requires RepeatMasker results as input. Am I correct? Or can either or both of my LTRharvest / LTR_FINDER results be used by solo_finder.pl?

And a slightly different but related question: Does LTR-retriever report partial length, i.e. truncated LTR-RTs that are neither solo LTRs, nor full-length LTRs, but in-betweens?

Thanks!

oushujun commented 5 years ago

Hi Anand,

The solo_finder.pl would be the script to find solo LTRs from RepeatMasker .out files. perl this_script.pl RepeatMasker.out > solo_list

The script solo_intact_ratio.pl is a script to calculate the solo:intact ratio.

perl this_script.pl solo_count intact_count > solo_intact_ratio

The two inputs are two lists of counts in the format: element1_id count element2_id count ...

You may need to reannotate them with the LTR library so that they match the same name in the same family.

Best, Shujun

On Thu, Mar 14, 2019, 8:47 AM anandksrao notifications@github.com wrote:

Hi Shujun,

I've completed by LTRharvest run, and close to completing the LTR_FINDER run as well.

In terms of solo LTR discovery and reporting, of all the Perl scripts listed under LTR_retriever/bin/, I can recognize only two Perl scripts that may be relevant: solo_finder.pl, and solo_intact_ratio.pl

From header information inside this script, it looks like solo_finder.pl requires RepeatMasker results as input. Am I correct? Or can either or both of my LTRharvest / LTR_FINDER results be used by solo_finder.pl?

And a slightly different but related question: Does LTR-retriever report partial length, i.e. truncated LTR-RTs that are neither solo LTRs, nor full-length LTRs, but in-betweens?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/41, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NOHz-lXFrgTc5MmqYih0IngLhvldks5vWlMDgaJpZM4b0L9j .

anandksrao commented 5 years ago

A few follow-up questions:

1. Will LTR_retriever directly generate the "intact_count" file, or will it be post-processing of some output file?

2. Will solo_count be identical to solo_list generated by solo_finder.pl, or require some post-processing?

3. You said "

You may need to reannotate them with the LTR library so that they match the same name in the same family.

Them = ? LTR family = info in which file, generated by which tool?

Thanks!

oushujun commented 5 years ago

For intact list, you can use the intact_finder_coarse.pl to find LTR elements with a roughtly complete structure (LTR-Internal-LTR): perl this_script.pl RepeatMasker.out > intact_list

In this way you don't need to reannotate the pass.list to have synchrinized names.

The output of these two lists are all elements found in the genome, you need to count the occurrance of each unique entries by yourself then feed them to the ratio script.

Shujun

On Thu, Mar 14, 2019, 9:39 AM anandksrao notifications@github.com wrote:

A few follow-up questions:

1. Will LTR_retriever directly generate the "intact_count" file, or will it be post-processing of some output file?

2. Will solo_count be identical to solo_list generated by solo_finder.pl, or require some post-processing?

3. You said "

You may need to reannotate them with the LTR library so that they match the same name in the same family.

Them = ? LTR family = info in which file, generated by which tool?

Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/41#issuecomment-472891734, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NLLVzmsQ-MBN4tlZkSUZqaVmv0I1ks5vWl8igaJpZM4b0L9j .

oushujun commented 5 years ago

Hi Anand,

I am closing this thread for now. Please re-open it or just comment below if you have further questions.

Best, Shujun

oushujun commented 5 years ago

Dear Shujun,

For identifying solo LTRs you'd referred me to a couple of scripts at LTR_retriever (1, 2), which in turn requires RepeatMasker results. And which in turn uses RepeatModeler results. Correct?

I am having just a little bit of trouble with RepeatModeler, and tried first at Robert Hubley's Github page and also on biostars, but no help from either place. So I am resorting to your help, sorry for any bother....

In the past, for my research, I've used RepeatMasker. If I remember correctly, I would specify -species and this would, I think, automatically call existing repeat sequences pre-packaged with RepeatMasker and from RepBaseUpdate download, does that sound about right?

This is the 1st time I am running RepeatModeler. And using instructions at http://www.repeatmasker.org/RepeatModeler/, I see some differences between my observed and expected results, based on explanation under Example Run and 3. Interpret the Results, which says: " At the completion of the run two files are generated:

-families.fa : Consensus sequences -families.stk : Seed alignments" I tried a test run with TAIR10 Arabidopsis thaliana Chromosome #1 (not the whole genome) My syntax was: Step 1 - BuildDatabase -engine ncbi -name Ath_Chr1 Ath_Chr1.fasta Resulting in the database files: ls Ath_Chr1.fasta Ath_Chr1.nin Ath_Chr1.nni Ath_Chr1.nsq Ath_Chr1.nhr Ath_Chr1.nnd Ath_Chr1.nog Ath_Chr1.translation Step 2 - nohup RepeatModeler -pa 11 -engine ncbi -database Ath_Chr1 >& run.out & Resulting in ls Ath_Chr1.fasta Ath_Chr1.nin Ath_Chr1.nni Ath_Chr1.nsq RM_35571.FriMar220438262019 unaligned.fa Ath_Chr1.nhr Ath_Chr1.nnd Ath_Chr1.nog Ath_Chr1.translation run.out Inside the output directory: ls consensi.fa consensi.fa.classified consensi.fa.masked round-1 round-2 round-3 round-4 Problem: I DO NOT see any output files with the expected filenames: Ath_Chr1-families.fa : Consensus sequences Ath_Chr1-families.stk : Seed alignments Am I misunderstanding something? Also in the run.out file, I see one line of error that reads as follows: grep -i error run.out NCBIBlastSearchEngine::search: Error...query (/home/aksrao/PhytozomeV12/RepeatModeler/TEST2/RM_35571.FriMar220438262019/round-1/family-104.fa) does not exist! If you can please help me out I would be really happy, because I am stuck at this step. To help you check whether my run completed OK, and the output files are OK to feed to RepeatMasker, I am attaching 5 output files in a zip folder here. Those 5 files are: run.out consensi.fa.classified consensi.fa.masked consensi.fa unaligned.fa From these bits of info, are you able to tell whether I should repeat this run, and what I should change to obtain the expected results? Once again, so sorry to bother - I wish Robert Hubley responded as quickly and thoroughly as you, and had a nice user manual like you have for LTR_retriever... Thanks a TON. Sincerely, Anand

oushujun commented 5 years ago

Hi @anandksrao ,

I paste your email in this thread so that other users may benefit from this coversation.

Sorry for any confusions regarding the usage of the aformentioned scripts to identify solo and complete LTR elements and calculate their ratio.

For the RepeatMasker.out file, I was referring to the whole-genome LTR annotation generated by LTR_retriever. It will be named as your_genome.fa.out. It does not need any of RepeatModeler output. For questions regarding the usage of RepeatModeler, plese directed them to Mr. Robert Hubley (https://systemsbiology.org/bio/robert-hubley/) and his team.

If you used the -noanno parameter when running LTR_retriever and want to obtain the RepeatMasker.out file, you can just run RepeatMasker again:

RepeatMasker -e ncbi -pa $threads -q -no_is -norna -nolow -div 40 -lib $genome.LTRlib.fa -cutoff 225 $genome

Please let me know if you have further questions.

Best, Shujun

anandksrao commented 5 years ago

Thank you for your earlier post clarifying I did not need to run RepeatModeler and RepeatMasker.

I just completed a test run of LTR_retriever, and want to return to my original goal in this thread - solo LTR reporting as gff3 file.

Using the following 4 input files: Test.fa # test genome Test.LTRharvest.scn # generated with LTRharvest Test.LTRharvest.nonTGCA.scn # also generated with LTRharvest, for non-canonical LTR-RTs Test.LTRfinder.scn # generated using LTR_FINDER

and with the following syntax: LTR_retriever -genome Test.fa -infinder Test.LTRfinder.scn -inharvest Test.LTRharvest.scn -nonTGCA Test.LTRharvest.nonTGCA.scn

After ~ 3 hr run, on a ~ 300MB test genome, I appear to have generated 68 output files. 1. Does this list of output files look complete? Tpases020812LINE.842717 Tpases020812DNA.842717 alluniRefprexp082813.842717 Test.fa.defalse Test.fa.nmtf.retriever.scn Test.fa.nmtf.retriever.scn.list Test.fa.nmtf.ltrTE.LTR Test.fa.nmtf.ltrTE.LTR.masked Test.fa.nmtf.ltrTE.LTR.out Test.fa.nmtf.ltrTE.LTR.tbl Test.fa.nmtf.ltrTE.LTR.ori.out Test.fa.nmtf.ltrTE.LTR.cat Test.fa.nmtf.ltrTE.LTR.masked.cleanup Test.fa.nmtf.ltrTE.fa Test.fa.nmtf.ltrTE.fa.cleanup Test.fa.nmtf.ltrTE.stg1 Test.fa.nmtf.retriever.scn.extend Test.fa.nmtf.retriever.scn.extend.fa Test.fa.nmtf.retriever.scn.extend.fa.aa Test.fa.nmtf.retriever.scn.extend.fa.aa.tbl Test.fa.nmtf.retriever.scn.extend.fa.aa.scn Test.fa.nmtf.retriever.scn.extend.fa.aa.anno Test.fa.nmtf.retriever.scn.adj Test.fa.nmtf.defalse Test.fa.retriever.all.scn Test.fa.nmtf.ltrTE.pass Test.fa.nmtf.ltrTE.pass.clust.clstr Test.fa.nmtf.ltrTE.stg2 Test.fa.nmtf.ltrTE.trunc.list Test.fa.nmtf.retriever.scn.adj.list Test.fa.nmtf.ltrTE.trunc Test.fa.nmtf.ltrTE.veryfalse.list Test.fa.nmtf.ltrTE.veryfalse Test.fa.nmtf.ltrTE.veryfalse.fa Test.fa.nmtf.ltrTE.mask.lib Test.fa.nmtf.ltrTE.trunc.tbl Test.fa.nmtf.ltrTE.trunc.cat Test.fa.nmtf.ltrTE.trunc.out Test.fa.nmtf.ltrTE.trunc.masked Test.fa.nmtf.ltrTE.trunc.ori.out Test.fa.nmtf.ltrTE.trunc.masked.cleanup Test.fa.nmtf.ltrTE.trunc.cln Test.fa.nmtf.ltrTE.stg3.cln Test.fa.nmtf.ltrTE.stg3.line.out Test.fa.nmtf.ltrTE.stg3.dna.out Test.fa.nmtf.ltrTE.stg3.otherTE.out Test.fa.nmtf.ltrTE.stg3.cln.clean Test.fa.nmtf.ltrTE.stg3.plantP.out Test.fa.nmtf.ltrTE.stg3.cln.clean.clean Test.fa.nmtf.ltrTE Test.fa.nmtf.ltrTE.pass.list Test.fa.nmtf.ltrTE.pass.nmtf.list Test.fa.nmtf.LTRlib.fa Test.fa.LTRlib.redundant.fa Test.fa.LTRlib.fa Test.fa.pass.list Test.fa.nmtf.pass.list Test.fa.pass.list.gff3 Test.fa.masked Test.fa.out Test.fa.tbl Test.fa.out.fam.size.list Test.fa.out.superfam.size.list Test.fa.out.LTR.distribution.txt Test.fa.out.gff Test.fa.out.LAI.LTR.ava.out Test.fa.out.q.LAI.LTR.ava.age Test.fa.out.LAI

On pages 6 and 7 of the Manual, I found mention of .pass.list LTRlib.fa .pass.list.gff3 .gff

2. But I could not find a place in this manual that described the contents of several other files. Is there a description anywhere?

3. If not, then are several of these intermediate files that we do not need to worry about?

4. While run was in progress, I noticed a RepeatMasker folder, but upon completion it was not found in the run folder. Is that normal?

5. Coming back to solo LTR detection, you had indicated perl solo_finder.pl input > solo.list Which result file do I use as input for this step? What is its file extension?

6. For reporting full-length LTRs, you had indicated perl intact_finder_coarse.pl input > intact.list Which result file do I use as input for this step? What is its file extension?

7. Manual says the output file .pass.list.gff3 contains all intact LTR-RTs in GFF3 file format. What additional information will intact_finder_coarse.pl return, on top of what is already found in .pass.list.gff3?

8 In this thread, in an earlier post, you mentioned - " intact_finder_coarse.pl to find LTR elements with a roughly complete structure (LTR-Internal-LTR)" Would the structure: LTR-internal = solo LTR or be reported as truncated LTR or something else?

9. For calculating the ratio of solo LTRs to intact LTR-RTs, your suggested syntax was: perl solo_intact_ratio.pl solo.list intact.list I have no questions about input files here, since they are already generated using other Perl scripts. But does the output contain ratio for a sliding window, or some other type of output?

Thanks, Shujun.

Cheers, Anand

oushujun commented 5 years ago

Hi Anand,

You likely used the -v option so that you got a lot of intermediate files. These files seem complete if their contents are not void. Most likely the files specified in the manual are the ones useful to you, others may be used for debugging purposes. RepeatMasker will generate a folder during its run which will be deleted when finished.

The input file for solo_finder.pl and intact_finder_coarse.pl is the RepeatMasker .out file, i.e., Test.fa.out in your case. The second script is to find elements that roughly have the LTR-IN-LTR structure without checking the fine-grain structure as LTR_retriever does. Solo-LTR contains only the LTR part, others are put in the truncated element category. Solo-complete ratio is per family based.

Best, Shujun

anandksrao commented 5 years ago

With your patient and detailed help, I was able to use solo_finder.pl and intact_finder_coarse.pl to get their respective outputs as shown below. Thank you, again. perl solo_finder.pl Test.fa.out > Test_solo.list perl intact_finder_coarse.pl Test.fa.out > Test_intact.list

1. In the solo_finder.pl output, I want to make sure I understand the output. For example, 1st line in Test_solo.list reads: Chr_01:44448..44844 Chr_04:8819278..8819678_LTR Column #1 is reporting pseduomolecule and coords for solo LTR #1 Column #2 is reporting pseduomolecule and coords for it's best matching intact LTR-RT sequence Correct?

2. I wanted to compare the Test.fa.pass.list.gff3 to Test_intact.list, because both contain info about intact LTR-RTs.

grep -v "#" Test.fa.pass.list.gff3 | wc -l
15796

Test.fa.pass.list.gff3 contains ALL Intact LTR-RTs (as explained on page 7 in manual)

wc -l Test_intact.list
6313 Test_intact.list

I am trying to reason why this count 6313 is significantly less than 15796 I reasoned it is because each single line in Test_intact.list corresponds to multiple lines in the .gff3 file. Those multiple lines are one each for repeat_region, left TSD, LTR_retrotransposon (internal region), left LTR, right LTR, right TSD. But since not all repeat regions contain all of these listed features, therefore 15796 <<< 63616 6. Is my interpretation correct?

3. About the intermediate files, I checked, but did not use the -v option. So, then, could it have been an option switched on during installation? Can I force turn off -verbose for other runs? Or re-install differently? This would be useful to learn, because the output folder reached ~600M for a ~290MB input file. I'd prefer to avoid generating the intermediate files, please.

4. Earlier in this thread, you mentioned the need for processing the output of solo_finder.pl and intact_finder_coarse.pl , before using it with your solo_intact_ratio.pl script. Could you please expand on that? It looks like the ratio script is looking for counts, rather that a list, correct?. So, how exactly should I convert solo.list to solo.counts, and intact.list to intact.counts? Is there some other script under /bin for this, or do I use some Perl, sed, awk one-liner for this? Please advise.

grep -v "#" Test.fa.out.gff | wc -l
164329

grep -v "#" Test.fa.pass.list.gff3 | wc -l
15796

Test.fa.out.gff = generated by RepeatMasker, contains info for all LTR-RTs in genome, correct? Test.fa.pass.list.gff3= generated by LTR_retriever, contains info for all intact LTR-RTs, correct?

I find overlaps of coordinates between consecutive lines in Test.fa.out.gff, as shown in example lines below:

##sequence-region contig_2125 1 3287
contig_2125 RepeatMasker    LTR/Gypsy   2584    2638    3.6 +   449 Chr_01:17280118..17289223_INT
contig_2125 RepeatMasker    LTR/Gypsy   2611    3287    12.3    +   2576    Chr_01:17280118..17289223_INT

Despite lines with overlapping coords, why is the line count for Test.fa.out.gff way way more than for Test.fa.pass.list.gff3 ? In fact, 164329 / 15796 > 10X fold difference in line counts...

6. Can the LTR_retriever output file Test.fa.out be processed to yield GFF format info for truncated LTR elements? Not solo, not intact, but truncated specifically.

7. Finally, it would be very helpful to generate an output file that includes detail of whether a prediction is intact, truncated or solo - ideally in a GFF3 format file. So with simple grep for intact, truncated, solo, user can count and tally no. of intacts + no. of truncateds + no. of solos = total number of all LTRs in genome. With current scripts and output files, is it possible to generate such an output file?

Thanks a ton!

oushujun commented 5 years ago

Hi Anand, an alternative to these codes is the method described in this paper: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2004-5-10-r79#MOESM5

Best, Shujun

anandksrao commented 5 years ago

I am confused.... I thought processing LTR_retriever's RepeatMasker.out with

a. solo_finder.pl => All cases of solo LTRs in the genome

b. intact_finder_coarse.pl => All cases of LTR-INTERNAL-LTR which is intact cases in the genome

So it is unclear why you are suggested a different paper / method / scripts.... could you please clarify?

Also, is the LTR_retriever output compatible for reporting truncated LTRs? If yes, what steps and syntax, please?

That's my final question for this thread. Thanks!

oushujun commented 5 years ago

Hi Anand,

You are right about the purpose of these two scripts. However, they may require some maintenance or documentation before you can easily use it. Unfortunately, I don't have time at the moment to further investigate the issue you experienced. I will work on them but not right now (likely in two weeks), and I don't want to hold your progress. You may also try to debug these scripts. I will keep this issue open until I get into them.

The method LTR_MINER I suggested in the last post seems to have more specialized functionality for your purpose, as an alternative of the two simple scripts I have.

Best,

Shujun

anandksrao commented 5 years ago

Hi Shujun - Thank you!

I thank you for thinking of my research goals and suggesting an alternative.

My 1st preference is your software because it is the most recent LTR tool, has extensive documentation, your support on GitHub, and a validation via a peer-reviewed manuscript. And so, I'd like to fully explore use of LTR_retriever first, before starting to explore LTR_MINER etc.

From your response, I am thinking you looked at output snippets in my post here and then inferred that neither solo_finder.pl nr intact_finder_coarse.pl are producing the expected outputs. Right?

If that is the case, then sure, I can try to debug those 2 scripts. But I need to know what their expected output / formats look like.
Please upload 2 example output files here or email them to me please.

Thanks, Shujun!

Cheers, Anand

oushujun commented 5 years ago

Hi Anand,

Sorry again for the delay of response. I debugged the code and they should work now. Please run perl solo_intact_ratio.pl for instructions. You will see quite a lot of inf rows due to the limitation of detecting LTR elements with complete structure (LTR-IN-LTR) using the RepeatMasker output. You may improve the algorithm if you have a better idea. For comparison between LTR families and species, you may just remove inf families and compare the rest.

I hope these codes help, or you may want to consider a more comprehensive method developed by Pereira (2004) as mentioned above.

Best, Shujun

anandksrao commented 5 years ago

At https://github.com/oushujun/LTR_retriever/tree/master/bin, it shows that two Perl scripts were updated a few hours ago:

solo_finder.pl
solo_intact_ratio.pl

See image below: GitHub_Screenshot

However, a check between old installed versus new versions of these Perl scripts reveals no difference.

diff /share/apps/LTR_retriever-2.1/bin/solo_finder.pl solo_finder.pl | wc -l
0

diff /share/apps/LTR_retriever-2.1/bin/intact_finder_coarse.pl intact_finder_coarse.pl | wc -l
0

Also, I could not obtain the help menu for new version of the script - solo_intact_ratio.pl.

perl solo_intact_ratio.pl 
Use of uninitialized value $ARGV[0] in concatenation (.) or string at solo_intact_ratio.pl line 10.

perl solo_intact_ratio.pl -h
perl solo_intact_ratio.pl -help
perl solo_intact_ratio.pl --h
perl solo_intact_ratio.pl --help

all give the same error message No such file or directory at solo_intact_ratio.pl line 10.

Could it be you have uploaded the old scripts back again? FYI, I used the new scripts as stand-alone, not after complete LTR_retriever re-installation.

Please clarify. Thanks!

oushujun commented 5 years ago

Hi Anand,

I am sorry that you are confused. Please run perl solo_intact_ratio.pl for instructions. No -help needed.

Best, Shujun

anandksrao commented 5 years ago

Thanks for claryifying.