Closed lcoombe closed 3 years ago
Hi Lauren,
It's likely that the program did not identify any non-TGCA LTR elements in the genome. Please check if there are any entries in the *nmtf.pass.list
file.
Best, Shujun
Hi Shujun,
Thanks for the prompt response!
I look a look at that file, but it looks like there are entries:
[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ ls *LTRlib*fa
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa
[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ cat *nmtf.pass.list
#LTR_loc Category Motif TSD 5_TSD 3_TSD Internal Identity Strand SuperFamily TE_type Insertion_Time
s00002979:17199..23609 pass motif:TGTA TSD:CTCGT 17194..17198 23610..23614 IN:17699..23109 0.9620 ? unknown NA 1499864
s00003068:6229694..6231445 pass motif:TGCT TSD:ATAAT 6229689..6229693 6231446..6231450 IN:6230017..6231122 0.969unknown NA 1212196
s00003156:318552..323859 pass motif:TGAC TSD:ACAAC 318547..318551 323862..323866 IN:319136..323281 0.9465 + GypsyLTR 2136454
s00003321:283843..285140 pass motif:TATA TSD:ATAGC 283838..283842 285141..285145 IN:284023..284970 0.9649 ? unknown NA 1382119
s00003397:107320..114677 pass motif:TGTA TSD:TATAT 107315..107319 114678..114682 IN:107742..114256 0.9378 - GypsyLTR 2497401
s00003422:254140..258317 pass motif:TGTG TSD:CTCTG 254135..254139 258318..258322 IN:254302..258155 0.9693 - GypsyLTR 1204601
s00003576:682825..684549 pass motif:TGGT TSD:ATGTA 682820..682824 684550..684554 IN:683010..684364 0.9462 ? unknown NA 2145677
s00003590:124843..130198 pass motif:TACA TSD:CTCAT 124838..124842 130199..130203 IN:125121..129921 0.9065 - GypsyLTR 3841969
s00003717:232257..237592 pass motif:TTTT TSD:TTGTT 232252..232256 237593..237597 IN:232443..237408 0.9514 + GypsyLTR 1934538
s00003947:179251..184709 pass motif:TGTA TSD:CTGGG 179246..179250 184710..184714 IN:179955..184005 0.9445 - GypsyLTR 2216757
s00004212:169574..175432 pass motif:TGTA TSD:TGATC 169569..169573 175433..175437 IN:170291..174715 0.9064 - GypsyLTR 3844193
s00004313:532336..536866 pass motif:TATA TSD:AAACA 532331..532335 536867..536871 IN:532786..536417 0.9443 ? unknown NA 2225177
Are there cases where it is expected to have entries in the file, but they don't end up in the *nmtf.LTRlib.fa file?
Thanks for your help! Lauren
Hi Lauren,
Can you paste the program screen output here? And if rerunning the program is not too slow, please rerun it with the -v
parameter so that we can check the intermediate files to further track down the cause.
Best, Shujun
Hi Shujun,
For sure -- here's the full log:
Parameters: -genome /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa -infinder /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.finder.scn -inharvest /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.scn -nonTGCA /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.nonTGCA.scn -threads 4 -noanno
Thu Aug 27 21:14:56 PDT 2020 Dependency checking: All passed!
Thu Aug 27 21:15:13 PDT 2020 LTR_retriever is starting from the Init step.
Thu Aug 27 21:15:25 PDT 2020 Start to convert inputs...
Total candidates: 2629
Total uniq candidates: 2557
Thu Aug 27 21:15:34 PDT 2020 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.
Thu Aug 27 21:17:29 PDT 2020 1040 clean candidates remained
Thu Aug 27 21:17:29 PDT 2020 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.
Thu Aug 27 21:19:15 PDT 2020 Intact LTR-RT found: 161
Thu Aug 27 21:19:20 PDT 2020 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.
Thu Aug 27 21:19:20 PDT 2020 54 truncated LTR-RTs found
Thu Aug 27 21:19:48 PDT 2020 21 truncated LTR sequences have added to the library
Thu Aug 27 21:19:48 PDT 2020 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
Total library sequences: 296
Thu Aug 27 21:23:10 PDT 2020 Retained clean sequence: 296
Thu Aug 27 21:23:10 PDT 2020 Sequence clustering for RC-genome-V4.500plus.seqtk_5.fa.ltrTE ...
Thu Aug 27 21:23:10 PDT 2020 Unique lib sequence: 296
Thu Aug 27 21:23:12 PDT 2020 Module 7: Start to analyze non-TGCA LTR-RT candidates...
Total non-TGCA candidates: 5823
Thu Aug 27 21:23:12 PDT 2020 Start to remove non-TGCA candidates that are >=60% identical to TGCA LTRs...
Thu Aug 27 21:25:08 PDT 2020 Total uniq non-TGCA candidates: 3880
Thu Aug 27 21:25:08 PDT 2020 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.
Thu Aug 27 21:25:12 PDT 2020 3719 clean non-TGCA candidates remained
Thu Aug 27 21:25:12 PDT 2020 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.
Thu Aug 27 21:31:47 PDT 2020 Intact non-TGCA LTR-RT found: 13
Thu Aug 27 21:31:51 PDT 2020 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.
Thu Aug 27 21:31:52 PDT 2020 37 truncated LTR-RTs found
Thu Aug 27 21:32:15 PDT 2020 58 truncated LTR sequences have added to the library
Thu Aug 27 21:32:15 PDT 2020 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
Total library sequences: 87
Thu Aug 27 21:33:18 PDT 2020 Retained clean sequence: 87
Thu Aug 27 21:33:19 PDT 2020 Module 6: Start to remove nested insertions in internal regions...
Thu Aug 27 21:34:26 PDT 2020 Raw internal region size (bit): 779494
Clean internal region size (bit): 692038
Thu Aug 27 21:34:26 PDT 2020 Sequence number of the redundant LTR-RT library: 600
The redundant LTR-RT library size (bit): 1066159
Thu Aug 27 21:34:26 PDT 2020 Module 8: Start to make non-redundant library...
Thu Aug 27 21:34:27 PDT 2020 Final LTR-RT library entries: 363
Final LTR-RT library size (bit): 808187
Thu Aug 27 21:34:27 PDT 2020 Total intact LTR-RTs found: 173
Total intact non-TGCA LTR-RTs found: 12
Thu Aug 27 21:34:31 PDT 2020 All analyses were finished!
##############################
####### Result files #########
##############################
Table output for intact LTR-RTs (detailed info)
RC-genome-V4.500plus.seqtk_5.fa.pass.list (All LTR-RTs)
RC-genome-V4.500plus.seqtk_5.fa.nmtf.pass.list (Non-TGCA LTR-RTs)
RC-genome-V4.500plus.seqtk_5.fa.pass.list.gff3 (GFF3 format for intact LTR-RTs)
LTR-RT library
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa (All non-redundant LTR-RTs)
RC-genome-V4.500plus.seqtk_5.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)
I'll also launch another run with the -v
!
Thanks, Lauren
With a glance the log file seems good to me. I will take a closer look at each step. The number of intact LTR elements seems a little bit low for me. Did you use the hard-masked genome or the soft/un-masked one for LTRharvest and LTR_FINDER?
It could be maybe partially because it's not a super contiguous assembly?? The N50 is ~150kb, and I split the file into partitions so it ran faster. The input genome is unmasked -- This is one of my steps to create a custom repeat library for my genome assembly so I can mask it before gene annotation.
Splitting the genome is suboptimal because the filtering step needs a bigger sample size to be effective. You may use more threads to run it and the parallelism is quite efficient.
So I did it because I'm running other tools as well (LTR finder, RepeatModeler), and the genome I'm working with is quite large (~6GB). Do you think that the parallelism would scale to a genome of that size??
Yes, it scales well. Check out the wheat issue for benchmarks. You may also try EDTA which integrates many good tools
On Sun, Aug 30, 2020 at 1:07 PM Lauren Coombe notifications@github.com wrote:
So I did it because I'm running other tools as well (LTR finder, RepeatModeler), and the genome I'm working with is quite large (~6GB). Do you think that the parallelism would scale to a genome of that size??
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/81#issuecomment-683445152, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHGU7HSMPLS35XT7GDSDKBL7ANCNFSM4QONGE7Q .
Ok cool - I'll give it a try without partitioning (it was too slow previously but I see there have been significant improvements since I last tried!). And thanks for the suggestion about EDTA -- I think another member of our group tried it but found that one of the components (I think TIR-learner) was quite slow, so that's why I haven't tried it myself yet. Thanks for your suggestions!
A recent update should have made TIR-Learner much faster. Please try it out if you get a chance. thanks!
Shujun
On Sun, Aug 30, 2020 at 2:34 PM Lauren Coombe notifications@github.com wrote:
Ok cool - I'll give it a try without partitioning (it was too slow previously but I see there have been significant improvements since I last tried!). And thanks for the suggestion about EDTA -- I think another member of our group tried it but found that one of the components (I think TIR-learner) was quite slow, so that's why I haven't tried it myself yet. Thanks for your suggestions!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/81#issuecomment-683473512, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFH4BWJJOZHGH3BVMDSDLAXFANCNFSM4QONGE7Q .
Hello,
I'm running LTR retriever v2.9.0 (installed via conda), and based on the logs I'm expecting to see these output files in my working directory:
However, I'm only seeing two of those files:
Specified parameters:
Any idea why that fasta file isn't being generated? Or am I looking in the wrong place?
Thanks so much! Lauren