Closed baozg closed 5 years ago
Hi @baozg ,
For LTR_FINDER_parallel
you should add the -harvest_out
parameter, then concatenate the output with the LTRharvest
output as the -inharvest
input for LTR_retriever
.
Hope this helps!
Best, Shujun
Hi Shujun,
I try the LTR_FINDER_parallel
harvest out, this run did get more intact LTR-RTs
(from 867 to 1360), but the LAI increase a little (form 5.22 to 6.70 ).
Here is the new command I use:
perl LTR_FINDER_parallel -harvest_out -seq genome.fa -threads 24
cat genome.fa.finder.combine.scn ltr.harvest.scn > genome.all.ltr.scn
LTR_retriever -genome genome.fa -inharvest genome.all.ltr.scn -threads 24
Here is the log for LTR-retriever
.
Sat Jul 13 00:09:53 CST 2019 Dependency checking: All passed!
Sat Jul 13 00:10:19 CST 2019 LTR_retriever is starting from the Init step.
Sat Jul 13 00:10:22 CST 2019 Start to convert inputs...
Total candidates: 6613
Total uniq candidates: 6104
Sat Jul 13 00:10:26 CST 2019 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.
Sat Jul 13 00:14:43 CST 2019 5236 clean candidates remained
Sat Jul 13 00:14:43 CST 2019 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.
Sat Jul 13 00:24:16 CST 2019 Intact LTR-RT found: 1493
Sat Jul 13 00:24:42 CST 2019 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.
Sat Jul 13 00:24:42 CST 2019 592 truncated LTR-RTs found
Sat Jul 13 00:25:46 CST 2019 117 truncated LTR sequences have added to the library
Sat Jul 13 00:25:46 CST 2019 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
Total library sequences: 1187
Sat Jul 13 00:29:00 CST 2019 Retained clean sequence: 1186
Sat Jul 13 00:29:00 CST 2019 Sequence clustering for castorbean.fa.ltrTE ...
Sat Jul 13 00:29:00 CST 2019 Unique lib sequence: 1171
Sat Jul 13 00:29:31 CST 2019 Module 6: Start to remove nested insertions in internal regions...
Sat Jul 13 00:31:44 CST 2019 Raw internal region size (bit): 3536462
Clean internal region size (bit): 2067109
Sat Jul 13 00:31:44 CST 2019 Sequence number of the redundant LTR-RT library: 4302
The redundant LTR-RT library size (bit): 9824575
Sat Jul 13 00:31:44 CST 2019 Module 8: Start to make non-redundant library...
Sat Jul 13 00:32:00 CST 2019 Final LTR-RT library entries: 1043
Final LTR-RT library size (bit): 2443739
Sat Jul 13 00:32:00 CST 2019 Total intact LTR-RTs found: 1360
Total intact non-TGCA LTR-RTs found: 88
######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################
Developer: Shujun Ou
Please cite:
Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: h
ttps://doi.org/10.1093/nar/gky730
Parameters: -genome genome.fa -intact genome.fa.pass.list -all genome.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/
Sat Jul 13 01:02:16 CST 2019 Dependency checking: Passed!
Sat Jul 13 01:02:16 CST 2019 Calculation of LAI will be based on the whole genome.
Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeolo
gues will overcorrect raw LAI scores.
Sat Jul 13 01:02:16 CST 2019 Estimate the identity of LTR sequences in the genome: quick mode
Sat Jul 13 06:32:46 CST 2019 The identity of LTR sequences: 93.9257712264989%
Sat Jul 13 06:32:46 CST 2019 Calculate LAI:
Done!
Sat Jul 13 06:32:54 CST 2019 Result file: castorbean.fa.out.LAI
You may use either raw_LAI or LAI for intraspecific comparison
but please use ONLY LAI for interspecific comparison
Here is the LAI result
Chr From To Intact Total raw_LAI LAI
whole_genome 1 336252733 0.0270 0.4163 6.49 6.70
Hi @baozg ,
Thanks for the update. The programs run correctly based on the information you provide. Having high-coverage Pacbio data does not guarantee a high continuous assembly. Next you may want to plot out the age distribution of intact LTRs in the .pass.list and try to figure why you get lower LAI. For high-quality genomes, young LTRs are the most abundant and their age tend to be <0.2 MY and decrease exponentially.
Best, Shujun
Hi Shujun,
Thanks for the reply. I found that Falcon Unzip result have 1529 intact LTR, LAI 8.92,HiC scaffolding indeed introduce some error in LTR region which BUSCO cannot find. I should add the LAI check for each assembly. Thanks for LTR_FINDER_parallel and LTR_retreiver.
Hello,
Thanks for your feedback! I never deal with HiC data before, but many sources indicate that HiC is messing up TE assemblies especially those highly repetitive LTR regions. Please let me know if you have more questions, good luck!
Best, Shujun
On Fri, Jul 19, 2019, 8:59 AM Zhigui Bao notifications@github.com wrote:
Hi Shujun,
Thanks for the reply. I found that Falcon Unzip result have 1529 intact LTR, LAI 8.92,HiC scaffolding indeed introduce some error in LTR region which BUSCO cannot find. I should add the LAI check for each assembly. Thanks for LTR_FINDER_parallel and LTR_retreiver.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/48?email_source=notifications&email_token=ABNX4NCJYITRG3A2HJRLHBDQAHCFVA5CNFSM4IBJGBSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LWR7Q#issuecomment-513239294, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNX4NDLYYEPTMGRXW37JV3QAHCFVANCNFSM4IBJGBSA .
Hi, Shujun,
I also use a genetic map to anchroing the scaffold(LG assembly,20000+marker order by Lep-Map). The strange things is the raw_LAI (6.49 vs 6.48)and Intact LTR(1489 vs 1493)of LG assembly seems same with the HiC assembly(after scaffolding, gapfilling by PBjelly and polish by Arrow and Pilon), but the LAI was higher(8.43 vs 6.70). Here is the log.
Sun Jul 28 02:57:38 CST 2019 Dependency checking: All passed!
Sun Jul 28 02:58:05 CST 2019 LTR_retriever is starting from the Init step.
Sun Jul 28 02:58:08 CST 2019 Start to convert inputs...
Total candidates: 6610
Total uniq candidates: 6093
Sun Jul 28 02:58:12 CST 2019 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
Sequences containing tandem repeats will be discarded.
Sun Jul 28 03:02:20 CST 2019 5252 clean candidates remained
Sun Jul 28 03:02:20 CST 2019 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.
Sun Jul 28 03:11:26 CST 2019 Intact LTR-RT found: 1489
Sun Jul 28 03:11:52 CST 2019 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
Use -notrunc if you don't want to keep them.
Sun Jul 28 03:11:52 CST 2019 602 truncated LTR-RTs found
Sun Jul 28 03:12:59 CST 2019 130 truncated LTR sequences have added to the library
Sun Jul 28 03:12:59 CST 2019 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
Total library sequences: 1190
Sun Jul 28 03:16:10 CST 2019 Retained clean sequence: 1188
Sun Jul 28 03:16:10 CST 2019 Sequence clustering for LG.fa.ltrTE ...
Sun Jul 28 03:16:10 CST 2019 Unique lib sequence: 1173
Sun Jul 28 03:16:40 CST 2019 Module 6: Start to remove nested insertions in internal regions...
Sun Jul 28 03:19:08 CST 2019 Raw internal region size (bit): 3532479
Clean internal region size (bit): 2123745
Sun Jul 28 03:19:09 CST 2019 Sequence number of the redundant LTR-RT library: 4296
The redundant LTR-RT library size (bit): 9814884
Sun Jul 28 03:19:09 CST 2019 Module 8: Start to make non-redundant library...
Sun Jul 28 03:19:24 CST 2019 Final LTR-RT library entries: 1048
Final LTR-RT library size (bit): 2497089
Sun Jul 28 03:19:25 CST 2019 Total intact LTR-RTs found: 1353
Total intact non-TGCA LTR-RTs found: 88
Sun Jul 28 03:19:27 CST 2019 Start to annotate whole-genome LTR-RTs...
Use -noanno if you don't want whole-genome LTR-RT annotation.
######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################
Developer: Shujun Ou
Please cite:
Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730
Parameters: -genome LG.fa -intact LG.fa.pass.list -all LG.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/
Sun Jul 28 03:47:52 CST 2019 Dependency checking: Passed!
Sun Jul 28 03:47:52 CST 2019 Calculation of LAI will be based on the whole genome.
Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Sun Jul 28 03:47:52 CST 2019 Estimate the identity of LTR sequences in the genome: quick mode
Sun Jul 28 08:48:05 CST 2019 The identity of LTR sequences: 93.3083019677034%
Sun Jul 28 08:48:05 CST 2019 Calculate LAI:
Done!
Chr From To Intact Total raw_LAI LAI
whole_genome 1 333678842 0.0270 0.4175 6.48 8.43
Scaffolding won't help to assemble LTRs or any other TEs because that's just the ordering of scaffolds with Ns in between. The similar raw LAIs tell you that.
The reason you are seeing higher LAI of the LG assembly is because you don't have arrow polishing for it, so that the assembly still contains a fair amount of sequencing errors which are transformed to a higher LTR diveristy and enentually made the LAI higher. This is not right. LAI cannot distinguish whether a mutation is biological true or just a sequencing error. You need to make sure the genome is relatively free of sequencing error then calculate LAI.
To find out if you still have sequencing errors, simply plot out the age distribution of intact LTRs in the pass.list file with intervals of 0.2 MY. Most LTRs have age of 0. If you see the peak is shifted to older age, then you may have some uncorrected sequencing errors.
On Sun, Jul 28, 2019, 8:18 PM Zhigui Bao notifications@github.com wrote:
Hi, Shujun,
I also use a genetic map to anchroing the scaffold(LG assembly,20000+marker order by Lep-Map). The strange things is the raw_LAI (6.49 vs 6.48)and Intact LTR(1489 vs 1493)of LG assembly seems same with the HiC assembly(after scaffolding, gapfilling by PBjelly and polish by Arrow and Pilon), but the LAI was higher(8.43 vs 6.70). Here is the log.
Sun Jul 28 02:57:38 CST 2019 Dependency checking: All passed!
Sun Jul 28 02:58:05 CST 2019 LTR_retriever is starting from the Init step.
Sun Jul 28 02:58:08 CST 2019 Start to convert inputs...
Total candidates: 6610 Total uniq candidates: 6093
Sun Jul 28 02:58:12 CST 2019 Module 1: Start to clean up candidates...
Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.
Sun Jul 28 03:02:20 CST 2019 5252 clean candidates remained
Sun Jul 28 03:02:20 CST 2019 Modules 2-5: Start to analyze the structure of candidates...
The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.
Sun Jul 28 03:11:26 CST 2019 Intact LTR-RT found: 1489
Sun Jul 28 03:11:52 CST 2019 Module 6: Start to analyze truncated LTR-RTs...
Truncated LTR-RTs without the intact version will be retained in the LTR-RT library. Use -notrunc if you don't want to keep them.
Sun Jul 28 03:11:52 CST 2019 602 truncated LTR-RTs found Sun Jul 28 03:12:59 CST 2019 130 truncated LTR sequences have added to the library
Sun Jul 28 03:12:59 CST 2019 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 1190 Sun Jul 28 03:16:10 CST 2019 Retained clean sequence: 1188
Sun Jul 28 03:16:10 CST 2019 Sequence clustering for LG.fa.ltrTE ... Sun Jul 28 03:16:10 CST 2019 Unique lib sequence: 1173
Sun Jul 28 03:16:40 CST 2019 Module 6: Start to remove nested insertions in internal regions... Sun Jul 28 03:19:08 CST 2019 Raw internal region size (bit): 3532479 Clean internal region size (bit): 2123745
Sun Jul 28 03:19:09 CST 2019 Sequence number of the redundant LTR-RT library: 4296 The redundant LTR-RT library size (bit): 9814884
Sun Jul 28 03:19:09 CST 2019 Module 8: Start to make non-redundant library...
Sun Jul 28 03:19:24 CST 2019 Final LTR-RT library entries: 1048 Final LTR-RT library size (bit): 2497089
Sun Jul 28 03:19:25 CST 2019 Total intact LTR-RTs found: 1353 Total intact non-TGCA LTR-RTs found: 88
Sun Jul 28 03:19:27 CST 2019 Start to annotate whole-genome LTR-RTs... Use -noanno if you don't want whole-genome LTR-RT annotation.
######################################
LTR Assembly Index (LAI) beta3.1
######################################
Developer: Shujun Ou
Please cite:
Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730
Parameters: -genome LG.fa -intact LG.fa.pass.list -all LG.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/
Sun Jul 28 03:47:52 CST 2019 Dependency checking: Passed!
Sun Jul 28 03:47:52 CST 2019 Calculation of LAI will be based on the whole genome.
Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Sun Jul 28 03:47:52 CST 2019 Estimate the identity of LTR sequences in the genome: quick mode
Sun Jul 28 08:48:05 CST 2019 The identity of LTR sequences: 93.3083019677034%
Sun Jul 28 08:48:05 CST 2019 Calculate LAI:
Done!
Chr From To Intact Total raw_LAI LAI
whole_genome 1 333678842 0.0270 0.4175 6.48 8.43
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/48?email_source=notifications&email_token=ABNX4NAY4BKKHTLLJY5ENA3QBZAODA5CNFSM4IBJGBSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27K5IQ#issuecomment-515813026, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNX4NGE3RYHS23YGXW6HL3QBZAODANCNFSM4IBJGBSA .
Hi Shujun,
I assembly a 335M diploid plant genome by Falcon-Unzip with 100x PacBio Sequel data, contig N50 9M. After using HiC data, I get chromosome-scale length genome with scaffold N50 32M.
I use LTR_retiever to find the LTR in this genome,41.19% of genome are LTR sequence,but the LAI is just 5.22 for this assembly.
It is very strange that a very contigous genome assembly only have LAI 5.22. What‘s the cause of low LAI ?Could be the assembly error?
Here is the full command I use.
Here is the full log of the LTR_retriever
But the final result of LAI is very low, here is the result.