oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

File *nmtf.LTRlib.fa not made #81

Closed lcoombe closed 3 years ago

lcoombe commented 3 years ago

Hello,

I'm running LTR retriever v2.9.0 (installed via conda), and based on the logs I'm expecting to see these output files in my working directory:

LTR-RT library
        RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
        RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa (All non-redundant LTR-RTs)
        RC-genome-V4.500plus.seqtk_5.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

However, I'm only seeing two of those files:

[lcoombe]$ ls *LTRlib*fa
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa  RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa

Specified parameters:

Parameters: -genome RC-genome-V4.500plus.seqtk_5.fa -infinder RC-genome-V4.500plus.seqtk_5.fa.finder.scn -inharvest RC-genome-V4.500plus.seqtk_5.fa.harvest.scn -nonTGCA RC-genome-V4.500plus.seqtk_5.fa.harvest.nonTGCA.scn -threads 4 -noanno

Any idea why that fasta file isn't being generated? Or am I looking in the wrong place?

Thanks so much! Lauren

oushujun commented 3 years ago

Hi Lauren,

It's likely that the program did not identify any non-TGCA LTR elements in the genome. Please check if there are any entries in the *nmtf.pass.list file.

Best, Shujun

lcoombe commented 3 years ago

Hi Shujun,

Thanks for the prompt response!

I look a look at that file, but it looks like there are entries:

[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ ls *LTRlib*fa
RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa  RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa
[lcoombe@hpce706 ltr_retriever-RC-genome-V4.500plus.seqtk_5.fa]$ cat *nmtf.pass.list
#LTR_loc    Category    Motif   TSD 5_TSD 3_TSD Internal    Identity    Strand  SuperFamily TE_type Insertion_Time
s00002979:17199..23609  pass    motif:TGTA  TSD:CTCGT   17194..17198    23610..23614    IN:17699..23109 0.9620  ?   unknown NA  1499864
s00003068:6229694..6231445  pass    motif:TGCT  TSD:ATAAT   6229689..6229693    6231446..6231450    IN:6230017..6231122 0.969unknown    NA  1212196
s00003156:318552..323859    pass    motif:TGAC  TSD:ACAAC   318547..318551  323862..323866  IN:319136..323281   0.9465  +   GypsyLTR    2136454
s00003321:283843..285140    pass    motif:TATA  TSD:ATAGC   283838..283842  285141..285145  IN:284023..284970   0.9649  ?   unknown NA  1382119
s00003397:107320..114677    pass    motif:TGTA  TSD:TATAT   107315..107319  114678..114682  IN:107742..114256   0.9378  -   GypsyLTR    2497401
s00003422:254140..258317    pass    motif:TGTG  TSD:CTCTG   254135..254139  258318..258322  IN:254302..258155   0.9693  -   GypsyLTR    1204601
s00003576:682825..684549    pass    motif:TGGT  TSD:ATGTA   682820..682824  684550..684554  IN:683010..684364   0.9462  ?   unknown NA  2145677
s00003590:124843..130198    pass    motif:TACA  TSD:CTCAT   124838..124842  130199..130203  IN:125121..129921   0.9065  -   GypsyLTR    3841969
s00003717:232257..237592    pass    motif:TTTT  TSD:TTGTT   232252..232256  237593..237597  IN:232443..237408   0.9514  +   GypsyLTR    1934538
s00003947:179251..184709    pass    motif:TGTA  TSD:CTGGG   179246..179250  184710..184714  IN:179955..184005   0.9445  -   GypsyLTR    2216757
s00004212:169574..175432    pass    motif:TGTA  TSD:TGATC   169569..169573  175433..175437  IN:170291..174715   0.9064  -   GypsyLTR    3844193
s00004313:532336..536866    pass    motif:TATA  TSD:AAACA   532331..532335  536867..536871  IN:532786..536417   0.9443  ?   unknown NA  2225177

Are there cases where it is expected to have entries in the file, but they don't end up in the *nmtf.LTRlib.fa file?

Thanks for your help! Lauren

oushujun commented 3 years ago

Hi Lauren,

Can you paste the program screen output here? And if rerunning the program is not too slow, please rerun it with the -v parameter so that we can check the intermediate files to further track down the cause.

Best, Shujun

lcoombe commented 3 years ago

Hi Shujun,

For sure -- here's the full log:

Parameters: -genome /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa -infinder /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.finder.scn -inharvest /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.scn -nonTGCA /projects/bullfrog_assembly_scratch/genome/annotation/version4/repeat-masking/custom-repeat-library/RC-genome-V4.500plus.seqtk_5.fa.harvest.nonTGCA.scn -threads 4 -noanno

Thu Aug 27 21:14:56 PDT 2020    Dependency checking: All passed!
Thu Aug 27 21:15:13 PDT 2020    LTR_retriever is starting from the Init step.
Thu Aug 27 21:15:25 PDT 2020    Start to convert inputs...
                Total candidates: 2629
                Total uniq candidates: 2557

Thu Aug 27 21:15:34 PDT 2020    Module 1: Start to clean up candidates...
                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                Sequences containing tandem repeats will be discarded.

Thu Aug 27 21:17:29 PDT 2020    1040 clean candidates remained

Thu Aug 27 21:17:29 PDT 2020    Modules 2-5: Start to analyze the structure of candidates...
                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Aug 27 21:19:15 PDT 2020    Intact LTR-RT found: 161

Thu Aug 27 21:19:20 PDT 2020    Module 6: Start to analyze truncated LTR-RTs...
                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                Use -notrunc if you don't want to keep them.

Thu Aug 27 21:19:20 PDT 2020    54 truncated LTR-RTs found
Thu Aug 27 21:19:48 PDT 2020    21 truncated LTR sequences have added to the library

Thu Aug 27 21:19:48 PDT 2020    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                Total library sequences: 296
Thu Aug 27 21:23:10 PDT 2020    Retained clean sequence: 296

Thu Aug 27 21:23:10 PDT 2020    Sequence clustering for RC-genome-V4.500plus.seqtk_5.fa.ltrTE ...
Thu Aug 27 21:23:10 PDT 2020    Unique lib sequence: 296

Thu Aug 27 21:23:12 PDT 2020    Module 7: Start to analyze non-TGCA LTR-RT candidates...
                Total non-TGCA candidates: 5823
Thu Aug 27 21:23:12 PDT 2020    Start to remove non-TGCA candidates that are >=60% identical to TGCA LTRs...
Thu Aug 27 21:25:08 PDT 2020    Total uniq non-TGCA candidates: 3880

Thu Aug 27 21:25:08 PDT 2020    Module 1: Start to clean up candidates...
                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                Sequences containing tandem repeats will be discarded.

Thu Aug 27 21:25:12 PDT 2020    3719 clean non-TGCA candidates remained

Thu Aug 27 21:25:12 PDT 2020    Modules 2-5: Start to analyze the structure of candidates...
                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Aug 27 21:31:47 PDT 2020    Intact non-TGCA LTR-RT found: 13

Thu Aug 27 21:31:51 PDT 2020    Module 6: Start to analyze truncated LTR-RTs...
                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                Use -notrunc if you don't want to keep them.

Thu Aug 27 21:31:52 PDT 2020    37 truncated LTR-RTs found
Thu Aug 27 21:32:15 PDT 2020    58 truncated LTR sequences have added to the library

Thu Aug 27 21:32:15 PDT 2020    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                Total library sequences: 87
Thu Aug 27 21:33:18 PDT 2020    Retained clean sequence: 87

Thu Aug 27 21:33:19 PDT 2020    Module 6: Start to remove nested insertions in internal regions...
Thu Aug 27 21:34:26 PDT 2020    Raw internal region size (bit): 779494
                Clean internal region size (bit): 692038

Thu Aug 27 21:34:26 PDT 2020    Sequence number of the redundant LTR-RT library: 600
                The redundant LTR-RT library size (bit): 1066159

Thu Aug 27 21:34:26 PDT 2020    Module 8: Start to make non-redundant library...

Thu Aug 27 21:34:27 PDT 2020    Final LTR-RT library entries: 363
                Final LTR-RT library size (bit): 808187

Thu Aug 27 21:34:27 PDT 2020    Total intact LTR-RTs found: 173
                Total intact non-TGCA LTR-RTs found: 12

Thu Aug 27 21:34:31 PDT 2020    All analyses were finished!

##############################
####### Result files #########
##############################

Table output for intact LTR-RTs (detailed info)
    RC-genome-V4.500plus.seqtk_5.fa.pass.list (All LTR-RTs)
    RC-genome-V4.500plus.seqtk_5.fa.nmtf.pass.list (Non-TGCA LTR-RTs)
    RC-genome-V4.500plus.seqtk_5.fa.pass.list.gff3 (GFF3 format for intact LTR-RTs)

LTR-RT library
    RC-genome-V4.500plus.seqtk_5.fa.LTRlib.redundant.fa (All LTR-RTs with redundancy)
    RC-genome-V4.500plus.seqtk_5.fa.LTRlib.fa (All non-redundant LTR-RTs)
    RC-genome-V4.500plus.seqtk_5.fa.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

I'll also launch another run with the -v!

Thanks, Lauren

oushujun commented 3 years ago

With a glance the log file seems good to me. I will take a closer look at each step. The number of intact LTR elements seems a little bit low for me. Did you use the hard-masked genome or the soft/un-masked one for LTRharvest and LTR_FINDER?

lcoombe commented 3 years ago

It could be maybe partially because it's not a super contiguous assembly?? The N50 is ~150kb, and I split the file into partitions so it ran faster. The input genome is unmasked -- This is one of my steps to create a custom repeat library for my genome assembly so I can mask it before gene annotation.

oushujun commented 3 years ago

Splitting the genome is suboptimal because the filtering step needs a bigger sample size to be effective. You may use more threads to run it and the parallelism is quite efficient.

lcoombe commented 3 years ago

So I did it because I'm running other tools as well (LTR finder, RepeatModeler), and the genome I'm working with is quite large (~6GB). Do you think that the parallelism would scale to a genome of that size??

oushujun commented 3 years ago

Yes, it scales well. Check out the wheat issue for benchmarks. You may also try EDTA which integrates many good tools

On Sun, Aug 30, 2020 at 1:07 PM Lauren Coombe notifications@github.com wrote:

So I did it because I'm running other tools as well (LTR finder, RepeatModeler), and the genome I'm working with is quite large (~6GB). Do you think that the parallelism would scale to a genome of that size??

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/81#issuecomment-683445152, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHGU7HSMPLS35XT7GDSDKBL7ANCNFSM4QONGE7Q .

lcoombe commented 3 years ago

Ok cool - I'll give it a try without partitioning (it was too slow previously but I see there have been significant improvements since I last tried!). And thanks for the suggestion about EDTA -- I think another member of our group tried it but found that one of the components (I think TIR-learner) was quite slow, so that's why I haven't tried it myself yet. Thanks for your suggestions!

oushujun commented 3 years ago

A recent update should have made TIR-Learner much faster. Please try it out if you get a chance. thanks!

Shujun

On Sun, Aug 30, 2020 at 2:34 PM Lauren Coombe notifications@github.com wrote:

Ok cool - I'll give it a try without partitioning (it was too slow previously but I see there have been significant improvements since I last tried!). And thanks for the suggestion about EDTA -- I think another member of our group tried it but found that one of the components (I think TIR-learner) was quite slow, so that's why I haven't tried it myself yet. Thanks for your suggestions!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/81#issuecomment-683473512, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFH4BWJJOZHGH3BVMDSDLAXFANCNFSM4QONGE7Q .