oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

very large reduction in LTRs when using LTR_retriever #11

Closed kaylahardwick closed 6 years ago

kaylahardwick commented 6 years ago

Hello! I am using LTR_retriever in conjunction with results from LTRharvest, and I have a general question about the LTR_retriever results. I have found that LTR_retriever greatly reduces the number of elements discovered, going from 1220 in the LTRharvest output file to a total of 41 in the LTRlib.fa and nmtf.LTRlib.fa files. When I run RepeatMasker with the LTRharvest output I get around 48% of total bases in my genome masked, whereas when I RepeatMask with the LTR_retriever output, just 6% of the bases are masked. I know that LTR_retriever is designed specifically to remove unreliable candidate sequences from the library, but with the drastic difference in results when I implement the program, I just want to make sure I'm doing everything correctly. We expect a high repeat content for our genome, but I know that LTRharvest can produce a lot of false positives.

Here are the commands I used for running LTRharvest and LTR_retriever:

$GENOMETOOLS suffixerator -db genome.fasta -indexname genome -tis -suf -lcp -des -ssp -sds -dna -memlimit 200GB

$GENOMETOOLS ltrharvest -index genome -gff3 genome.ltrharvest.gff3 -seqids yes -minlenltr 100 -maxlenltr 5000 -mindistltr 1000 -maxdistltr 20000 -similar 85 -mintsd 4 -motif tgca -motifmis 1 -overlaps best -outinner outinner/genome.ltrharvest.outinner.fasta -out genome.ltrharvest.fasta > genome.ltrharvest.out

LTR_retriever -genome genome.fasta -inharvest genome.ltrharvest.out -threads 20 1>ltrretriever.log 2>ltrretriever.err

Do the parameter values seem okay to you? What do you think about the differences in the RepeatMasker results between the LTRharvest and LTR_retriever libraries? I would appreciate any input you have. I am also happy to continue this conversation over email, but thought I would post it here initially in case it might be helpful for anyone else running the program.

Thanks so much!

Kayla

oushujun commented 6 years ago

Hello Kayla,

Thank you for the descriptions. Your parameters seem OK to me. Please check and try the following things:

  1. how many intact LTR-RTs did LTR_retriever found? They are all listed in the genome.fasta.pass.file.

  2. Do you see sequences with naming patterns of "LTR#LTR" in the genome.fasta.LTRlib.fa file? If not, I am sorry this is a bug I recently identified. The current version has it fixed. Please update your LTR_retriever and try again.

  3. If the library size is very small, (i.e., less than 5 Mb), try to use the redundant version as the library for RepeatMasker: genome.fasta.LTRlib.redundant.fa. You may gain slightly higher masking.

  4. You can add the LTR_finder output as one more input source for LTR_retriever. In practice, we have higher sensitivity with combined inputs. Please refer to the manual for parameters to run LTR_finder.

  5. In some cases, if the genome assembly quality is very low, then very limited intact LTR-RTs could be confidently found, which would result in the small and incomplete LTR library. We developed a new metric to evaluate the assembly of repeat sequences, call LTR Assembly Index (LAI). Can you check the genome.fasta.out.LAI file and look for the LAI of the second line (whole_genome)? LAI<5 indicates draft quality.

Please let me know if you have more questions.

Best, Shujun

kaylahardwick commented 6 years ago

Thanks for your response! The genome.fasta.pass.file has just 5 entries in it. It looks like there are 20 entries with the naming pattern "LTR#LTR" in the genome.fasta.LTRlib.fa. The whole genome LAI is listed as 0.1, and there are only 5 entries in the file where the "Intact" column is not equal to 0. Our current draft assembly does have very low contiguity, with an N50 value around 3kb. So the most likely issue is that our assembly is too fragmented in its current state to use with LTR_retriever?

oushujun commented 6 years ago

Hello Kayla,

I am sorry this is likely the case. 3kb is too short to span a regular LTR element that has a mean length about 5 kb. You may try LTR_finder that has the second highest performance and relatively flexible based on my knowledge. Let me know if you need further help.

Best, Shujun