oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
177 stars 40 forks source link

invertebrate support #34

Closed lurebgi closed 5 years ago

lurebgi commented 5 years ago

Hi,

I was wondering if LTR_retriever supports invertebrate genomes. We have an amphioxus genome derived from 60X Pacbio sequencing, however, it shows the LAI score is only 7.07. Moreover, all of the 206 LTRs in LTRlib.fa were classified as 'Unknown'. Does this look normal to you?

Thank you!

Luohao

oushujun commented 5 years ago

Hi Luohao,

I tried LTR_retriever on fruitfly, mouse, micro- and mega- bats, and human, and it worked similarly as in plants, although most of these species have much less LTR content in their genomes. LAI requires a minimum of 5% total LTR and 0.1% intact LTR sequences present in the genome for the purpose of accurate evaluation, so you may need to check these two values.

For classification of LTR superfamilies, LTR_retriever uses models trained from rice LTR classifications, so the same model may not be applicable to invertebrate genomes. However, the classification information is not the major factor to identify LTR elements. You may need to do the classification yourself based on the identified LTR elements.

Best, Shujun

lurebgi commented 5 years ago

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

wangzhennan14 commented 5 years ago

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

Hi Luohao, Where did you download the mgescan-1.1? Can you give me the url? I have download three mgescan packages, but all of them did not work.

Thank you very Much! Zhennan

lurebgi commented 5 years ago

I did not actually use mgescan-1.1, as shujun suggested in some of the threads.

On Wed, 16 Jan 2019, 02:17 wangzhennan14 <notifications@github.com wrote:

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

Hi Luohao, Where did you download the mgescan-1.1? Can you give me the url? I have download three mgescan packages, but all of them did not work.

Thank you very Much! Zhennan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/34#issuecomment-454614037, or mute the thread https://github.com/notifications/unsubscribe-auth/AE478S6-a58-Ii67-UGacsvMH1lh30pQks5vDn2OgaJpZM4ZpX_Y .

oushujun commented 5 years ago

@wangzhennan14

For MGEScan_LTR please refer to #8 and #19. Let me know if you need further help, thanks!

Shujun

oushujun commented 5 years ago

@lurebgi

Sorry for delay response (somehow I thought I did).

Your commands look good, but I have no idea about the total LTR content of amphioxus. If you suspect high proportions of false positives, you may manually curate a couple of them to verify (try NCBI blast and see what are they). If you do find some, please post example sequences here with 100bp extended on up- and downstreams, which would help to debug.

If LTR content is too low, then LAI is not accurate. You may plot out regional LAI values in the *.LAI file to see if there is any uneven distribution. Using long reads is not a guarantee of assembly quality, which is also depended on a lot of things.

Shujun

lurebgi commented 5 years ago

Hi, thanks for getting back to me. Yes you replied before on the amphioxus issue. However, I am no longer interested in amphioxus LTR since there are not many anyway.

My second question (sorry for mixing up questions) was about a cichlid fish (tilapia) which should have about 4% LTR. If you are interested in the false positives, maybe you can download the genome from https://www.ncbi.nlm.nih.gov/assembly/GCF_001858045.2 and test your program? Sorry but at least for now I am not going to further analyze LTR_retriever results at least for tilapias.

L

On Wed, Jan 16, 2019 at 7:56 AM Shujun Ou notifications@github.com wrote:

@lurebgi https://github.com/lurebgi

Sorry for delay response (somehow I thought I did).

Your commands look good, but I have no idea about the total LTR content of amphioxus. If you suspect high proportions of false positives, you may manually curate a couple of them to verify (try NCBI blast and see what are they). If you do find some, please post example sequences here with 100bp extended on up- and downstreams, which would help to debug.

If LTR content is too low, then LAI is not accurate. You may plot out regional LAI values in the *.LAI file to see if there is any uneven distribution. Using long reads is not a guarantee of assembly quality, which is also depended on a lot of things.

Shujun

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/34#issuecomment-454672606, or mute the thread https://github.com/notifications/unsubscribe-auth/AE478W8Z1m6nLLyCPwYHcwHQRT4z1f9yks5vDs0mgaJpZM4ZpX_Y .

oushujun commented 5 years ago

@lurebgi I am curious how the 4% LTR in tilapia is estimated?

lurebgi commented 5 years ago

by repeatmasker using a library from Repbase plus repeatModeler library. This paper shows a similar result: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3723-5

On Tue, Jan 29, 2019 at 1:54 AM Shujun Ou notifications@github.com wrote:

@lurebgi https://github.com/lurebgi I am curious how the 4% LTR in tilapia is estimated?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/34#issuecomment-458363344, or mute the thread https://github.com/notifications/unsubscribe-auth/AE478WrPTl4ROxcODRtXng57V4FHuuzOks5vH5vFgaJpZM4ZpX_Y .

oushujun commented 5 years ago

@lurebgi Repbase is a database for known TEs. The sequence of LTR elements varies wildly between species, so using other species's LTR sequence to identify the tilapia LTR sequence should be an underestimate. RepeatModeler is a general method for TE identification. It has some attempts to classify TEs but also not accurate in our experience. RepeatModeler can work as a supplement after some good identifications, but Repbase is not a good approach for LTR finding.

lurebgi commented 5 years ago

Thanks for the explanation. However, according to https://www.nature.com/articles/nature13726, it is likely true that cichlid fish (including tilapia) have a relatively low content of LTRs. That said, it would be very interesting to note that LTR_retriever actually identified many unannotated LTRs in cichlids.

On Tue, Jan 29, 2019 at 3:42 PM Shujun Ou notifications@github.com wrote:

@lurebgi https://github.com/lurebgi Repbase is a database for known TEs. The sequence of LTR elements varies wildly between species, so using other species's LTR sequence to identify the tilapia LTR sequence should be an underestimate. RepeatModeler is a general method for TE identification. It has some attempts to classify TEs but also not accurate in our experience. RepeatModeler can work as a supplement after some good identifications, but Repbase is not a good approach for LTR finding.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/34#issuecomment-458564573, or mute the thread https://github.com/notifications/unsubscribe-auth/AE478WCeJeqEJSBXjUqA5D8F2hfw_pFzks5vIF2-gaJpZM4ZpX_Y .

oushujun commented 5 years ago

@lurebgi Thanks for sharing the paper. I read the method section. TE annotations were based on RepeatModeler or RepeatScout, so this is kind of a loop. Since both methods are copy-number based, low copy number TEs will be missed out. You may try to figure what new elements are annotated by LTR_retriever. I'll be happy to see how it works/fails.