oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
193 stars 40 forks source link

Low LAI for a plant assembly with 100x PacBio data #48

Closed baozg closed 5 years ago

baozg commented 5 years ago

Hi Shujun,

I assembly a 335M diploid plant genome by Falcon-Unzip with 100x PacBio Sequel data, contig N50 9M. After using HiC data, I get chromosome-scale length genome with scaffold N50 32M.
I use LTR_retiever to find the LTR in this genome,41.19% of genome are LTR sequence,but the LAI is just 5.22 for this assembly.

It is very strange that a very contigous genome assembly only have LAI 5.22. What‘s the cause of low LAI ?Could be the assembly error?

Here is the full command I use.

# LTR_FINDER_parallel (it's super fast, Thanks for sharing, use the default parametets in perl script)
perl LTR_FINDER_parallel -seq genome.fa -threads 24 

# LTRharvest
gt suffixerator -db genome.fa -indexname genome.fa -tis -suf -lcp -des -ssp -sds -dna
gt ltrharvest -index genome.fa -similar 85 -vic 10 -seed 20 -seqids yes -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motifTGCA -motifmis 1 > ltr.harvest.scn

# LTR_retriever 
LTR_retriever -genome genome.fa -infinder genome.fa.finder.combine.scn -inharvest ltr.harvest.scn -threads 24

Here is the full log of the LTR_retriever

Thu Jul 11 14:21:09 CST 2019    Dependency checking: All passed!
Thu Jul 11 14:21:37 CST 2019    LTR_retriever is starting from the Init step.
Thu Jul 11 14:21:40 CST 2019    Start to convert inputs...
                                Total candidates: 4637
                                Total uniq candidates: 4637

Thu Jul 11 14:21:45 CST 2019    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Thu Jul 11 14:25:34 CST 2019    3862 clean candidates remained

Thu Jul 11 14:25:34 CST 2019    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Jul 11 14:31:17 CST 2019    Intact LTR-RT found: 867

Thu Jul 11 14:31:32 CST 2019    Module 6: Start to analyze truncated LTR-RTs...
                                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                                Use -notrunc if you don't want to keep them.

Thu Jul 11 14:31:32 CST 2019    489 truncated LTR-RTs found
Thu Jul 11 14:32:26 CST 2019    189 truncated LTR sequences have added to the library

Thu Jul 11 14:32:26 CST 2019    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 908
Thu Jul 11 14:35:24 CST 2019    Retained clean sequence: 907

Thu Jul 11 14:35:24 CST 2019    Sequence clustering for genome.fa.ltrTE ...
Thu Jul 11 14:35:24 CST 2019    Unique lib sequence: 880

Thu Jul 11 14:35:42 CST 2019    Module 6: Start to remove nested insertions in internal regions...
Thu Jul 11 14:37:29 CST 2019    Raw internal region size (bit): 2410618
                                Clean internal region size (bit): 1600078

Thu Jul 11 14:37:29 CST 2019    Sequence number of the redundant LTR-RT library: 2790
                                The redundant LTR-RT library size (bit): 6059706

Thu Jul 11 14:37:29 CST 2019    Module 8: Start to make non-redundant library...

Thu Jul 11 14:37:41 CST 2019    Final LTR-RT library entries: 796
                                Final LTR-RT library size (bit): 1876515

Thu Jul 11 14:37:41 CST 2019    Total intact LTR-RTs found: 867
                                Total intact non-TGCA LTR-RTs found: 87

Thu Jul 11 14:37:43 CST 2019    Start to annotate whole-genome LTR-RTs...
                                Use -noanno if you don't want whole-genome LTR-RT annotation.
######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730

Parameters: -genome genome.fa -intact genome.fa.pass.list -all genome.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/

Thu Jul 11 15:11:12 CST 2019    Dependency checking: Passed!
Thu Jul 11 15:11:12 CST 2019    Calculation of LAI will be based on the whole genome.
                                Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Thu Jul 11 15:11:12 CST 2019    Estimate the identity of LTR sequences in the genome: quick mode
Thu Jul 11 20:47:46 CST 2019    The identity of LTR sequences: 93.6033346761206%
Thu Jul 11 20:47:46 CST 2019    Calculate LAI:

                                                Done!

Thu Jul 11 20:47:54 CST 2019    Result file: genome.fa.out.LAI

                                You may use either raw_LAI or LAI for intraspecific comparison

But the final result of LAI is very low, here is the result.

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       336252733       0.0169  0.4119  4.10    5.22
oushujun commented 5 years ago

Hi @baozg ,

For LTR_FINDER_parallel you should add the -harvest_out parameter, then concatenate the output with the LTRharvest output as the -inharvest input for LTR_retriever.

Hope this helps!

Best, Shujun

baozg commented 5 years ago

Hi Shujun,

I try the LTR_FINDER_parallel harvest out, this run did get more intact LTR-RTs (from 867 to 1360), but the LAI increase a little (form 5.22 to 6.70 ).

Here is the new command I use:

perl LTR_FINDER_parallel -harvest_out -seq genome.fa -threads 24

cat genome.fa.finder.combine.scn ltr.harvest.scn > genome.all.ltr.scn

LTR_retriever -genome genome.fa -inharvest genome.all.ltr.scn -threads 24

Here is the log for LTR-retriever.

Sat Jul 13 00:09:53 CST 2019    Dependency checking: All passed!
Sat Jul 13 00:10:19 CST 2019    LTR_retriever is starting from the Init step.
Sat Jul 13 00:10:22 CST 2019    Start to convert inputs...
                                Total candidates: 6613
                                Total uniq candidates: 6104

Sat Jul 13 00:10:26 CST 2019    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Sat Jul 13 00:14:43 CST 2019    5236 clean candidates remained

Sat Jul 13 00:14:43 CST 2019    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Sat Jul 13 00:24:16 CST 2019    Intact LTR-RT found: 1493

Sat Jul 13 00:24:42 CST 2019    Module 6: Start to analyze truncated LTR-RTs...
                                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                                Use -notrunc if you don't want to keep them.

Sat Jul 13 00:24:42 CST 2019    592 truncated LTR-RTs found
Sat Jul 13 00:25:46 CST 2019    117 truncated LTR sequences have added to the library

Sat Jul 13 00:25:46 CST 2019    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1187
Sat Jul 13 00:29:00 CST 2019    Retained clean sequence: 1186
Sat Jul 13 00:29:00 CST 2019    Sequence clustering for castorbean.fa.ltrTE ...
Sat Jul 13 00:29:00 CST 2019    Unique lib sequence: 1171

Sat Jul 13 00:29:31 CST 2019    Module 6: Start to remove nested insertions in internal regions...
Sat Jul 13 00:31:44 CST 2019    Raw internal region size (bit): 3536462
                                Clean internal region size (bit): 2067109

Sat Jul 13 00:31:44 CST 2019    Sequence number of the redundant LTR-RT library: 4302
                                The redundant LTR-RT library size (bit): 9824575

Sat Jul 13 00:31:44 CST 2019    Module 8: Start to make non-redundant library...

Sat Jul 13 00:32:00 CST 2019    Final LTR-RT library entries: 1043
                                Final LTR-RT library size (bit): 2443739

Sat Jul 13 00:32:00 CST 2019    Total intact LTR-RTs found: 1360
                                Total intact non-TGCA LTR-RTs found: 88

######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: h
ttps://doi.org/10.1093/nar/gky730

Parameters: -genome genome.fa -intact genome.fa.pass.list -all genome.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/

Sat Jul 13 01:02:16 CST 2019    Dependency checking: Passed!
Sat Jul 13 01:02:16 CST 2019    Calculation of LAI will be based on the whole genome.
                                Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeolo
gues will overcorrect raw LAI scores.
Sat Jul 13 01:02:16 CST 2019    Estimate the identity of LTR sequences in the genome: quick mode
Sat Jul 13 06:32:46 CST 2019    The identity of LTR sequences: 93.9257712264989%
Sat Jul 13 06:32:46 CST 2019    Calculate LAI:

                                                Done!

Sat Jul 13 06:32:54 CST 2019    Result file: castorbean.fa.out.LAI

                                You may use either raw_LAI or LAI for intraspecific comparison
                                but please use ONLY LAI for interspecific comparison

Here is the LAI result

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       336252733       0.0270  0.4163  6.49    6.70
oushujun commented 5 years ago

Hi @baozg ,

Thanks for the update. The programs run correctly based on the information you provide. Having high-coverage Pacbio data does not guarantee a high continuous assembly. Next you may want to plot out the age distribution of intact LTRs in the .pass.list and try to figure why you get lower LAI. For high-quality genomes, young LTRs are the most abundant and their age tend to be <0.2 MY and decrease exponentially.

Best, Shujun

baozg commented 5 years ago

Hi Shujun,

Thanks for the reply. I found that Falcon Unzip result have 1529 intact LTR, LAI 8.92,HiC scaffolding indeed introduce some error in LTR region which BUSCO cannot find. I should add the LAI check for each assembly. Thanks for LTR_FINDER_parallel and LTR_retreiver.

oushujun commented 5 years ago

Hello,

Thanks for your feedback! I never deal with HiC data before, but many sources indicate that HiC is messing up TE assemblies especially those highly repetitive LTR regions. Please let me know if you have more questions, good luck!

Best, Shujun

On Fri, Jul 19, 2019, 8:59 AM Zhigui Bao notifications@github.com wrote:

Hi Shujun,

Thanks for the reply. I found that Falcon Unzip result have 1529 intact LTR, LAI 8.92,HiC scaffolding indeed introduce some error in LTR region which BUSCO cannot find. I should add the LAI check for each assembly. Thanks for LTR_FINDER_parallel and LTR_retreiver.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/48?email_source=notifications&email_token=ABNX4NCJYITRG3A2HJRLHBDQAHCFVA5CNFSM4IBJGBSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LWR7Q#issuecomment-513239294, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNX4NDLYYEPTMGRXW37JV3QAHCFVANCNFSM4IBJGBSA .

baozg commented 5 years ago

Hi, Shujun,

I also use a genetic map to anchroing the scaffold(LG assembly,20000+marker order by Lep-Map). The strange things is the raw_LAI (6.49 vs 6.48)and Intact LTR(1489 vs 1493)of LG assembly seems same with the HiC assembly(after scaffolding, gapfilling by PBjelly and polish by Arrow and Pilon), but the LAI was higher(8.43 vs 6.70). Here is the log.

Sun Jul 28 02:57:38 CST 2019    Dependency checking: All passed!
Sun Jul 28 02:58:05 CST 2019    LTR_retriever is starting from the Init step.
Sun Jul 28 02:58:08 CST 2019    Start to convert inputs...
                                Total candidates: 6610
                                Total uniq candidates: 6093

Sun Jul 28 02:58:12 CST 2019    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Sun Jul 28 03:02:20 CST 2019    5252 clean candidates remained

Sun Jul 28 03:02:20 CST 2019    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Sun Jul 28 03:11:26 CST 2019    Intact LTR-RT found: 1489

Sun Jul 28 03:11:52 CST 2019    Module 6: Start to analyze truncated LTR-RTs...
                                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                                Use -notrunc if you don't want to keep them.

Sun Jul 28 03:11:52 CST 2019    602 truncated LTR-RTs found
Sun Jul 28 03:12:59 CST 2019    130 truncated LTR sequences have added to the library

Sun Jul 28 03:12:59 CST 2019    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1190
Sun Jul 28 03:16:10 CST 2019    Retained clean sequence: 1188

Sun Jul 28 03:16:10 CST 2019    Sequence clustering for LG.fa.ltrTE ...
Sun Jul 28 03:16:10 CST 2019    Unique lib sequence: 1173

Sun Jul 28 03:16:40 CST 2019    Module 6: Start to remove nested insertions in internal regions...
Sun Jul 28 03:19:08 CST 2019    Raw internal region size (bit): 3532479
                                Clean internal region size (bit): 2123745

Sun Jul 28 03:19:09 CST 2019    Sequence number of the redundant LTR-RT library: 4296
                                The redundant LTR-RT library size (bit): 9814884

Sun Jul 28 03:19:09 CST 2019    Module 8: Start to make non-redundant library...

Sun Jul 28 03:19:24 CST 2019    Final LTR-RT library entries: 1048
                                Final LTR-RT library size (bit): 2497089

Sun Jul 28 03:19:25 CST 2019    Total intact LTR-RTs found: 1353
                                Total intact non-TGCA LTR-RTs found: 88

Sun Jul 28 03:19:27 CST 2019    Start to annotate whole-genome LTR-RTs...
                                Use -noanno if you don't want whole-genome LTR-RT annotation.

######################################
### LTR Assembly Index (LAI) beta3.1 ###
######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730

Parameters: -genome LG.fa -intact LG.fa.pass.list -all LG.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/

Sun Jul 28 03:47:52 CST 2019    Dependency checking: Passed!
Sun Jul 28 03:47:52 CST 2019    Calculation of LAI will be based on the whole genome.
                                Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Sun Jul 28 03:47:52 CST 2019    Estimate the identity of LTR sequences in the genome: quick mode
Sun Jul 28 08:48:05 CST 2019    The identity of LTR sequences: 93.3083019677034%
Sun Jul 28 08:48:05 CST 2019    Calculate LAI:

                                                Done!
Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       333678842       0.0270  0.4175  6.48    8.43
oushujun commented 5 years ago

Scaffolding won't help to assemble LTRs or any other TEs because that's just the ordering of scaffolds with Ns in between. The similar raw LAIs tell you that.

The reason you are seeing higher LAI of the LG assembly is because you don't have arrow polishing for it, so that the assembly still contains a fair amount of sequencing errors which are transformed to a higher LTR diveristy and enentually made the LAI higher. This is not right. LAI cannot distinguish whether a mutation is biological true or just a sequencing error. You need to make sure the genome is relatively free of sequencing error then calculate LAI.

To find out if you still have sequencing errors, simply plot out the age distribution of intact LTRs in the pass.list file with intervals of 0.2 MY. Most LTRs have age of 0. If you see the peak is shifted to older age, then you may have some uncorrected sequencing errors.

On Sun, Jul 28, 2019, 8:18 PM Zhigui Bao notifications@github.com wrote:

Hi, Shujun,

I also use a genetic map to anchroing the scaffold(LG assembly,20000+marker order by Lep-Map). The strange things is the raw_LAI (6.49 vs 6.48)and Intact LTR(1489 vs 1493)of LG assembly seems same with the HiC assembly(after scaffolding, gapfilling by PBjelly and polish by Arrow and Pilon), but the LAI was higher(8.43 vs 6.70). Here is the log.

Sun Jul 28 02:57:38 CST 2019 Dependency checking: All passed!

Sun Jul 28 02:58:05 CST 2019 LTR_retriever is starting from the Init step.

Sun Jul 28 02:58:08 CST 2019 Start to convert inputs...

                            Total candidates: 6610

                            Total uniq candidates: 6093

Sun Jul 28 02:58:12 CST 2019 Module 1: Start to clean up candidates...

                            Sequences with 10 missing bp or 0.8 missing data rate will be discarded.

                            Sequences containing tandem repeats will be discarded.

Sun Jul 28 03:02:20 CST 2019 5252 clean candidates remained

Sun Jul 28 03:02:20 CST 2019 Modules 2-5: Start to analyze the structure of candidates...

                            The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Sun Jul 28 03:11:26 CST 2019 Intact LTR-RT found: 1489

Sun Jul 28 03:11:52 CST 2019 Module 6: Start to analyze truncated LTR-RTs...

                            Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.

                            Use -notrunc if you don't want to keep them.

Sun Jul 28 03:11:52 CST 2019 602 truncated LTR-RTs found Sun Jul 28 03:12:59 CST 2019 130 truncated LTR sequences have added to the library

Sun Jul 28 03:12:59 CST 2019 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 1190 Sun Jul 28 03:16:10 CST 2019 Retained clean sequence: 1188

Sun Jul 28 03:16:10 CST 2019 Sequence clustering for LG.fa.ltrTE ... Sun Jul 28 03:16:10 CST 2019 Unique lib sequence: 1173

Sun Jul 28 03:16:40 CST 2019 Module 6: Start to remove nested insertions in internal regions... Sun Jul 28 03:19:08 CST 2019 Raw internal region size (bit): 3532479 Clean internal region size (bit): 2123745

Sun Jul 28 03:19:09 CST 2019 Sequence number of the redundant LTR-RT library: 4296 The redundant LTR-RT library size (bit): 9814884

Sun Jul 28 03:19:09 CST 2019 Module 8: Start to make non-redundant library...

Sun Jul 28 03:19:24 CST 2019 Final LTR-RT library entries: 1048 Final LTR-RT library size (bit): 2497089

Sun Jul 28 03:19:25 CST 2019 Total intact LTR-RTs found: 1353 Total intact non-TGCA LTR-RTs found: 88

Sun Jul 28 03:19:27 CST 2019 Start to annotate whole-genome LTR-RTs... Use -noanno if you don't want whole-genome LTR-RT annotation.

######################################

LTR Assembly Index (LAI) beta3.1

######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730

Parameters: -genome LG.fa -intact LG.fa.pass.list -all LG.fa.out -t 24 -q -blast /data/software/Anaconda3/envs/maker/bin/

Sun Jul 28 03:47:52 CST 2019 Dependency checking: Passed!

Sun Jul 28 03:47:52 CST 2019 Calculation of LAI will be based on the whole genome.

                            Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.

Sun Jul 28 03:47:52 CST 2019 Estimate the identity of LTR sequences in the genome: quick mode

Sun Jul 28 08:48:05 CST 2019 The identity of LTR sequences: 93.3083019677034%

Sun Jul 28 08:48:05 CST 2019 Calculate LAI:

                                            Done!

Chr From To Intact Total raw_LAI LAI

whole_genome 1 333678842 0.0270 0.4175 6.48 8.43

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/48?email_source=notifications&email_token=ABNX4NAY4BKKHTLLJY5ENA3QBZAODA5CNFSM4IBJGBSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27K5IQ#issuecomment-515813026, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNX4NGE3RYHS23YGXW6HL3QBZAODANCNFSM4IBJGBSA .