oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

Short sequences in LTRlib.fa #7

Closed sahammond closed 6 years ago

sahammond commented 7 years ago

Hi Shujun,

I was wondering if it's expected that there are many short sequences in ~.LTRlib.fa? Of the 8424 entries, 81 are 11 bp long, and 542 are less than 100 bp long.

Also, is ~.LTRlib.fa the file that I can use as a custom RepeatMasker library for other related genome assemblies?

Thanks, Austin

oushujun commented 7 years ago

Hi Austin,

That looks not right. The minimum length of a library entry should be 100bp. Please check the STDOUT to see any errors. To answer your second question, it dependents. The LTRlib.fa can be used as a custom RM library for the genome it was constructed from, but for a sister species, it dependents on how close they are. LTR evolves very fast, if the two species diverged more than 4 MY, a lot of LTRs are not shared. If you have both genomes, why don't you construct libraries for both of them?

Regards, Shujun

sahammond commented 7 years ago

Hi Shujun,

The log looks clean, aside from these five warnings:

Wed Jul 26 09:45:15 PDT 2017 Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

Use of uninitialized value $chr_pre in hash element at /projects/btl/shammond/git/LTR_retriever/bin/call_seq_by_list.pl line 81. Warning: [blastn] lcl|Query_1 Pg-02-2358717:7..13490[2]: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options Warning: [blastn] lcl|Query_1 Pg-02-2399049:12..11834[2]: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options Warning: [blastn] lcl|Query_1 Pg-02-2475365:2871..10563[2]: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options Warning: [blastn] lcl|Query_1 Pg-02-2692868:12..5422[2]: Warning: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options Sun Aug 6 11:15:32 PDT 2017 Intact non-TGCA LTR found: 94

Could they be responsible for the short sequences? Would it be valid to exclude any sequences < 100 bp long?

I have a newer genome assembly from the same individual, for which I'd like to use the RM library generated from a previous assembly. I also have 2 genomes from related species, but I'll generate libraries for them separately. The genomes I'm working with are ~20 Gbp, and LTR_retriever took about 3 weeks to output the LTRlib (48 cores); it's still repeat masking batches of scaffolds.

Thanks, Austin

oushujun commented 7 years ago

Hi Austin,

Looks like you are using different genome versions. Please make sure the genome you provided to LTR_retriever is the same version to that you used to generate the raw LTR candidates. If two different versions of genomes were provided, the coordinates specified in the raw candidate file are not guiding the program to the right positions, which could be risky that some gene sequences were used as LTR candidates. But for a genome this large, such risk could be low. You may try to remove those short sequences and see how the library performs (which will also risk your time). I recommend that you use a smaller subset of your genome, say 500mb, to see if the end product also likes this. Another thing you can do to accelerate your run: you can split the genome file into smaller parts, say 4GB each, then run everything independently (including LTRharvest), and combine libraries at the end for the whole-genome RM masking. The reason is that LTR_retriever is a bit limited by PERL's IO bottleneck, which would not increase CPU usage efficiency even you provide a lot of cores to it. I've tested it in our server and gained an actual CPU usage ~800%. So a better practice with 48 CPUs is to run 5 LTR_retriever runs at the same time. More details: With split runs, you should add -noanno (not annotating whole-genome LTRs at the end of LTR_retriever) for each LTR_retriever task. After all split runs finished, use CD-HIT to reduce redundancy of sub-libraries: cat *LTRlib.fa > all.LTRlib.fa cd-hit-est -i all.LTRlib.fa -o all.LTRlib.fa.uniq -c 0.8 -G 0.8 -s 0.9 -aL 0.9 -aS 0.9 -M 0 -T 48 Then use RM to perform whole-genome annotation: nohup RepeatMasker -e ncbi -q -pa 48 -no_is -norna -nolow -div 40 -lib all.LTRlib.fa.uniq -cutoff 225 whole-genome.fa &

Hope this helps with your future run. Let me know if you have more questions.

Regards, Shujun

sahammond commented 7 years ago

Hi Shujun,

Thanks very much for your detailed advice. I'll split my genome as you've recommended and post an update once it's finished.

Cheers, Austin

sahammond commented 7 years ago

Hi Shujun,

Unfortunately, even with making sure that each split piece had its own candidate predictions for LTR_retriever to assess, I still have many short LTR sequences in LTRlib.fa:

Switched to perl-5.26.0.

##########################

LTR_retriever v1.3

##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00 -infinder /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.finder.scn -inharvest /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.harvest.scn -nonTGCA /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.harvest.nonTGCA.scn -threads 8 -noanno

Previous LTR_retriever results found, backed up to LTRretriever-pre08-25-17_1716

Fri Aug 25 17:16:50 PDT 2017 The longest sequence ID in the genome contains 605 characters, which is longer than the limit (15) Trying to reformat seq IDs... Attempt 1... Fri Aug 25 17:17:46 PDT 2017 Seq ID conversion successful!

Fri Aug 25 17:17:46 PDT 2017 Start to convert inputs... Total candidates: 21146 Total uniq candidates: 20386

Fri Aug 25 17:19:06 PDT 2017 Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Fri Aug 25 17:31:36 PDT 2017 11977 clean candidates remained

Fri Aug 25 17:31:36 PDT 2017 Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

Fri Aug 25 20:26:19 PDT 2017 Intact LTR found: 1713

Fri Aug 25 20:29:46 PDT 2017 Start to analyze truncated LTRs... Truncated LTRs without the intact version will be retained in the LTR library. Use -notrunc if you don't want to keep them.

Fri Aug 25 20:29:46 PDT 2017 948 truncated LTRs found Fri Aug 25 20:36:25 PDT 2017 273 truncated LTR sequences have added to the library

Fri Aug 25 20:36:25 PDT 2017 Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 2297 Fri Aug 25 21:17:56 PDT 2017 Retained clean sequence: 2296

Fri Aug 25 21:17:56 PDT 2017 Sequence clustering for /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.ltrTE ... Fri Aug 25 21:17:56 PDT 2017 Unique lib sequence: 2260

Fri Aug 25 21:20:03 PDT 2017 Start to analyze non-TGCA LTR candidates... Total non-TGCA candidates: 45472 Fri Aug 25 21:20:03 PDT 2017 Start to remove non-TGCA candidates that are >=60% identical to TGCA LTRs... Fri Aug 25 22:20:33 PDT 2017 Total uniq non-TGCA candidates: 20059

Fri Aug 25 22:20:34 PDT 2017 Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Fri Aug 25 22:23:24 PDT 2017 13166 clean non-TGCA candidates remained

Fri Aug 25 22:23:24 PDT 2017 Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

Sat Aug 26 03:03:06 PDT 2017 Intact non-TGCA LTR found: 18

Sat Aug 26 03:03:29 PDT 2017 Start to analyze truncated LTRs... Truncated LTRs without the intact version will be retained in the LTR library. Use -notrunc if you don't want to keep them.

Sat Aug 26 03:03:29 PDT 2017 777 truncated LTRs found Sat Aug 26 03:05:34 PDT 2017 1367 truncated LTR sequences have added to the library

Sat Aug 26 03:05:34 PDT 2017 Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 1403 Sat Aug 26 03:18:08 PDT 2017 Retained clean sequence: 1403

Sat Aug 26 03:18:11 PDT 2017 Start to remove nested insertions in internal regions... Sat Aug 26 03:21:21 PDT 2017 Raw internal region size (bit): 7058461 Clean internal region size (bit): 2689986

Sat Aug 26 03:21:21 PDT 2017 Start to make non-redundant library...

Sat Aug 26 03:22:21 PDT 2017 Final LTR library entries: 2963 Final LTR library size (bit): 5057379

Sat Aug 26 03:22:21 PDT 2017 Total intact LTRs found: 1674 Total intact non-TGCA LTRs found: 11

rm: cannot remove dummy060817.fa*': No such file or directory rm: cannot remove/projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.cat.gz': No such file or directory rm: cannot remove /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.LTRlib.fa.n*': No such file or directory rm: cannot remove/projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.nmtf': No such file or directory Sat Aug 26 03:22:42 PDT 2017 All analyses were finished!

############################## ####### Result files ######### ##############################

Table output for intact LTRs (detailed info) /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.pass.list (All LTRs) /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.nmtf.pass.list (Non-TGCA LTRs)

LTR library /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.LTRlib.fa (All non-redundant LTRs) /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.nmtf.LTRlib.fa (Non-TGCA LTRs)

GFF3 file for intact LTRs /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00.mod.pass.list.gff3

Sequence stats for WS77111v2.00.mod.LTRlib.fa:

n n:0 L50 min N80 N50 N20 E-size max sum name
2963 2963 654 11 1343 2303 4276 2744 11144 4824649 WS77111v2.00.mod.LTRlib.fa

Is it possible that the short sequences are truncated LTRs?

Thanks, Austin

oushujun commented 7 years ago

Hi Austin,

Thank you for providing your report. I didn't find any serious errors despite a cleanup bug in the code that I have fixed. My biggest guess is that the short seqs are produced during non-LTR sequence purging. I will take a further look at the code and get back to you later. Right now I think you can just remove those shorter than 100bp and move on your analyses. Sorry for the inconvenience.

Good luck!

Shujun

On Aug 30, 2017 1:23 PM, "Austin Hammond" notifications@github.com wrote:

Hi Shujun,

Unfortunately, even with making sure that each split piece had its own candidate predictions for LTR_retriever to assess, I still have many short LTR sequences in LTRlib.fa:

Switched to perl-5.26.0.

########################## LTR_retriever v1.3

##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome /projects/spruceup/pglauca/WS77111/annotation/repeat-elements/version2/WS77111v2.00 -infinder /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.finder.scn -inharvest /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.harvest.scn -nonTGCA /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.harvest.nonTGCA.scn -threads 8 -noanno

Previous LTR_retriever results found, backed up to LTRretriever-pre08-25-17_1716

Fri Aug 25 17:16:50 PDT 2017 The longest sequence ID in the genome contains 605 characters, which is longer than the limit (15) Trying to reformat seq IDs... Attempt 1... Fri Aug 25 17:17:46 PDT 2017 Seq ID conversion successful!

Fri Aug 25 17:17:46 PDT 2017 Start to convert inputs... Total candidates: 21146 Total uniq candidates: 20386

Fri Aug 25 17:19:06 PDT 2017 Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Fri Aug 25 17:31:36 PDT 2017 11977 clean candidates remained

Fri Aug 25 17:31:36 PDT 2017 Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

Fri Aug 25 20:26:19 PDT 2017 Intact LTR found: 1713

Fri Aug 25 20:29:46 PDT 2017 Start to analyze truncated LTRs... Truncated LTRs without the intact version will be retained in the LTR library. Use -notrunc if you don't want to keep them.

Fri Aug 25 20:29:46 PDT 2017 948 truncated LTRs found Fri Aug 25 20:36:25 PDT 2017 273 truncated LTR sequences have added to the library

Fri Aug 25 20:36:25 PDT 2017 Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 2297 Fri Aug 25 21:17:56 PDT 2017 Retained clean sequence: 2296

Fri Aug 25 21:17:56 PDT 2017 Sequence clustering for /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.ltrTE ... Fri Aug 25 21:17:56 PDT 2017 Unique lib sequence: 2260

Fri Aug 25 21:20:03 PDT 2017 Start to analyze non-TGCA LTR candidates... Total non-TGCA candidates: 45472 Fri Aug 25 21:20:03 PDT 2017 Start to remove non-TGCA candidates that are

=60% identical to TGCA LTRs... Fri Aug 25 22:20:33 PDT 2017 Total uniq non-TGCA candidates: 20059

Fri Aug 25 22:20:34 PDT 2017 Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Fri Aug 25 22:23:24 PDT 2017 13166 clean non-TGCA candidates remained

Fri Aug 25 22:23:24 PDT 2017 Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

Sat Aug 26 03:03:06 PDT 2017 Intact non-TGCA LTR found: 18

Sat Aug 26 03:03:29 PDT 2017 Start to analyze truncated LTRs... Truncated LTRs without the intact version will be retained in the LTR library. Use -notrunc if you don't want to keep them.

Sat Aug 26 03:03:29 PDT 2017 777 truncated LTRs found Sat Aug 26 03:05:34 PDT 2017 1367 truncated LTR sequences have added to the library

Sat Aug 26 03:05:34 PDT 2017 Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 1403 Sat Aug 26 03:18:08 PDT 2017 Retained clean sequence: 1403

Sat Aug 26 03:18:11 PDT 2017 Start to remove nested insertions in internal regions... Sat Aug 26 03:21:21 PDT 2017 Raw internal region size (bit): 7058461 Clean internal region size (bit): 2689986

Sat Aug 26 03:21:21 PDT 2017 Start to make non-redundant library...

Sat Aug 26 03:22:21 PDT 2017 Final LTR library entries: 2963 Final LTR library size (bit): 5057379

Sat Aug 26 03:22:21 PDT 2017 Total intact LTRs found: 1674 Total intact non-TGCA LTRs found: 11

rm: cannot remove dummy060817.fa': No such file or directory rm: cannot remove/projects/spruceup/pglauca/WS77111/annotation/ repeat-elements/version2/WS77111v2.00.mod.cat.gz': No such file or directory rm: cannot remove /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.LTRlib.fa.n': No such file or directory rm: cannot remove/projects/spruceup/pglauca/WS77111/annotation/ repeat-elements/version2/WS77111v2.00.mod.nmtf': No such file or directory Sat Aug 26 03:22:42 PDT 2017 All analyses were finished!

############################## ####### Result files ######### ##############################

Table output for intact LTRs (detailed info) /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.pass.list (All LTRs) /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.nmtf.pass.list (Non-TGCA LTRs)

LTR library /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.LTRlib.fa (All non-redundant LTRs) /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.nmtf.LTRlib.fa (Non-TGCA LTRs)

GFF3 file for intact LTRs /projects/spruceup/pglauca/WS77111/annotation/repeat- elements/version2/WS77111v2.00.mod.pass.list.gff3

Sequence stats for WS77111v2.00.mod.LTRlib.fa: n n:0 L50 min N80 N50 N20 E-size max sum name 2963 2963 654 11 1343 2303 4276 2744 11144 4824649 WS77111v2.00.mod.LTRlib.fa

Is it possible that the short sequences are truncated LTRs?

Thanks, Austin

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/7#issuecomment-326060943, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NA2FYm41I0MXmACUTZMoX7zWKhT3ks5sdZqmgaJpZM4O1GmY .

sahammond commented 7 years ago

Thanks Shujun! I'll go ahead and remove the shorter sequences and use the rest.

Cheers, Austin

oushujun commented 7 years ago

Hi Austin,

I have checked the code and yes, there was a bug in the version you used. Some candidates with a very short length of the internal region were slipped through the length filter. These candidates should be removed because they are mostly not LTR elements. I have fixed the bug in the latest version and updated last week. You don't have to rerun all analyses to save time, but please remove those short sequences (INT in usual, let me know if they are not) as well as the related LTR sequence. You can search the coordinate of the INT seq in the pass.list file to find out the coordinate of the related LTR region. I am very sorry for the inconvenience. Please don't hesitate to let me know if there is anything I can help with.

Cheers, Shujun

sahammond commented 6 years ago

Hi Shujun,

Thanks for following up. I omitted sequences that were < 100 bp long in the combined post-cdhit LTRlib.fa file (all.LTRlib.fa.uniq, as noted in one of your previous posts), then supplied this fasta as part of a library to RepeatMasker. Is this approach sufficient? I'm not currently analyzing the LTR sequences themselves, rather I'm using the whole/fragmented retrotransposons primarily for genome masking.

Cheers, Austin

oushujun commented 6 years ago

Hi Austin,

Yes, this should be enough. I will close this thread then. Thank you for using LTR_retriever. Good luck!

Cheers, Shujun