oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

Blastn warning: subject sequence contains no data, and final retained clean sequence: 0 #4

Closed rimjhimroy closed 7 years ago

rimjhimroy commented 7 years ago

I downloaded the latest master version today of LTR_retriever and am trying to run it with my plant genome sequence with LTR_retriever -genome genome.fasta -inharvest genome.ltrharvest.scn -infinder genome.finder.scn -notrunc -threads 4 -v

It runs giving several warnings and finally gives retained clean sequence: 0

Here's all the files it created:

-rw------- 1 rimjhim rimjhim 336732337 Jun  7 16:39 genome.fasta
-rw------- 1 rimjhim rimjhim   1221771 Jun  7 19:42 genome.fasta.retriever.scn
-rw------- 1 rimjhim rimjhim   1100599 Jun  7 19:42 genome.fasta.retriever.scn.list
-rw------- 1 rimjhim rimjhim    531646 Jun  7 19:42 genome.fasta.retriever.scn.full
-rw------- 1 rimjhim rimjhim  90860431 Jun  7 19:42 genome.fasta.ltrTE.fa
-rw------- 1 rimjhim rimjhim     88066 Jun  7 19:48 genome.fasta.ltrTE.fa.cleanup
-rw------- 1 rimjhim rimjhim  77634137 Jun  7 19:48 genome.fasta.ltrTE.stg1
-rw------- 1 rimjhim rimjhim    441393 Jun  7 19:48 genome.fasta.retriever.scn.extend
-rw------- 1 rimjhim rimjhim  78564160 Jun  7 19:48 genome.fasta.retriever.scn.extend.fa
-rw------- 1 rimjhim rimjhim 159268484 Jun  7 19:50 genome.fasta.retriever.scn.extend.fa.aa
-rw------- 1 rimjhim rimjhim   4871577 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.tbl
-rw------- 1 rimjhim rimjhim  12960211 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.scn
-rw------- 1 rimjhim rimjhim    761839 Jun  7 19:51 genome.fasta.retriever.scn.extend.fa.aa.anno
-rw------- 1 rimjhim rimjhim   2850208 Jun  7 20:14 genome.fasta.defalse
-rw------- 1 rimjhim rimjhim   1775029 Jun  7 20:14 genome.fasta.retriever.scn.adj
-rw------- 1 rimjhim rimjhim    354229 Jun  7 20:14 genome.fasta.ltrTE.pass.list
-rw------- 1 rimjhim rimjhim  18232267 Jun  7 20:14 genome.fasta.ltrTE.pass
-rw------- 1 rimjhim rimjhim    399189 Jun  7 20:16 genome.fasta.ltrTE.pass.clust.clstr
-rw------- 1 rimjhim rimjhim   9189982 Jun  7 20:16 genome.fasta.ltrTE.stg2
-rw------- 1 rimjhim rimjhim     41920 Jun  7 20:16 genome.fasta.ltrTE.trunc.list
-rw------- 1 rimjhim rimjhim     94880 Jun  7 20:16 genome.fasta.retriever.scn.adj.list
-rw------- 1 rimjhim rimjhim   7174609 Jun  7 20:16 genome.fasta.ltrTE.trunc
-rw------- 1 rimjhim rimjhim     74082 Jun  7 20:16 genome.fasta.ltrTE.veryfalse.list
-rw------- 1 rimjhim rimjhim     25416 Jun  7 20:16 genome.fasta.ltrTE.veryfalse
-rw------- 1 rimjhim rimjhim   4941076 Jun  7 20:16 genome.fasta.ltrTE.veryfalse.fa
-rw------- 1 rimjhim rimjhim  14131058 Jun  7 20:16 genome.fasta.ltrTE.mask.lib
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:16 genome.fasta.ltrTE.trunc.cln
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:16 genome.fasta.ltrTE.trunc.masked.cleanup
-rw------- 1 rimjhim rimjhim   9189982 Jun  7 20:16 genome.fasta.ltrTE.stg3.cln
-rw------- 1 rimjhim rimjhim    927563 Jun  7 20:19 genome.fasta.ltrTE.stg3.line.out
-rw------- 1 rimjhim rimjhim    865686 Jun  7 20:22 genome.fasta.ltrTE.stg3.dna.out
-rw------- 1 rimjhim rimjhim   1793249 Jun  7 20:22 genome.fasta.ltrTE.stg3.otherTE.out
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.exclude.list
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.clean
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE.stg3.plantP.out
-rw------- 1 rimjhim rimjhim        47 Jun  7 20:22 genome.fasta.ltrTE.stg3.cln.clean.exclude.list
-rw------- 1 rimjhim rimjhim         0 Jun  7 20:22 genome.fasta.ltrTE
-rw------- 1 rimjhim rimjhim      7822 Jun  7 20:22 genome.fasta.ltrTE.pass.nmtf.list
-rw------- 1 rimjhim rimjhim      7822 Jun  7 20:22 genome.fasta.nmtf.pass.list
-rw------- 1 rimjhim rimjhim    354229 Jun  7 20:22 genome.fasta.pass.list

And this is the output I got.

##########################
### LTR_retriever v1.2 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome genome.fasta -inharvest genome.ltrharvest.scn -infinder genome.finder.scn -threads 4 -v

Previous LTR_retriever results found, backed up to LTRretriever-pre06-07-17_1942

Mit Jun  7 19:42:12 CEST 2017   Start to convert inputs...
                Total candidates: 12077
                Total uniq candidates: 11267

Mit Jun  7 19:42:16 CEST 2017   Start to clean up candidates...
                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                Sequences containing tandem repeats will be discarded.

Mit Jun  7 19:48:24 CEST 2017   9370 clean candidates remained

Mit Jun  7 19:48:24 CEST 2017   Start to analyze the structure of candidates...
                The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr5:12326794..12333995|chr5:12326844..12333945: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr6:21651306..21669448|chr6:21651356..21669398: Subject sequence contians no data
Warning: [blastn] Subject_1 chr6:22457268..22468094|chr6:22457318..22468044: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr6:33073266..33087676|chr6:33073316..33087626: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:20294844..20300297|chr7:20294894..20300247: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:25108611..25136833[1]: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr7:36106845..36124957|chr7:36106895..36124907: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 chr8:4633806..4645912|chr8:4633856..4645862: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
BLAST engine error: Warning: Sequence contains no data 
Warning: [blastn] Subject_1 scaffold286:18245..31457|scaffold286:18295..31407: Subject sequence contians no data
BLAST engine error: Warning: Sequence contains no data 
Mit Jun  7 20:14:11 CEST 2017   Intact LTR found: 2670

Mit Jun  7 20:16:48 CEST 2017   Start to analyze truncated LTRs...
                Truncated LTRs without the intact version will be retained in the LTR library.
                Use -notrunc if you don't want to keep them.

Mit Jun  7 20:16:48 CEST 2017   884 truncated LTRs found
ERROR: No such file or directory at /home/rimjhim/Softwares/LTR_retriever-master/bin/cleanup.pl line 50.
Mit Jun  7 20:16:54 CEST 2017   0 truncated LTR sequences have added to the library

Mit Jun  7 20:16:54 CEST 2017   Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                Total library sequences: 1865
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Warning: [blastx] Query is Empty!
Mit Jun  7 20:22:07 CEST 2017   Retained clean sequence: 0

ERROR: 2670 intact LTRs have found, but the pre-library file genome.fasta.ltrTE is empty.
Something is wrong at this point. Please report the bug to https://github.com/oushujun/LTR_retriever/issues
Program halt!

I am not sure why it is complaining that subject sequence contains no data and why the file genome.fasta.ltrTE is empty. Any suggestions?

oushujun commented 7 years ago

Hi,

There is an open issue reporting the same bug. I am trying to reproduce this bug. Please indicate what versions of your dependent programs. Thank you!

Regards, Shujun

rimjhimroy commented 7 years ago

Hi, Thank you very much for your reply. Sorry, I should have commented on the open issue, but I thought the warning about the subject sequence was a bit different for me. The versions of the dependent programs are: BLAST+: 2.2.31+ HMMER: 3.1b2 CDHIT: 4.6 RepeatMasker: 4.0.5

oushujun commented 7 years ago

Hi,

Thanks for the information. Two things I need you to check: less genome.fasta.retriever.scn.extend.fa, then search the following sequences: chr5:12326794..12333995|chr5:12326844..12333945 chr6:21651306..21669448|chr6:21651356..21669398 chr7:25108611..25136833[1] Please see if these sequences are really empty or look just like others.

Please also check with your sysadmin to see if RepeatMasker was installed using HMMER as the primary search engine. You should use rmblast instead if this is the case. You can test run: RepeatMasker -q -pa 4 -no_is -norna -nolow -div 40 -lib genome.fasta.ltrTE.mask.lib -cutoff 225 genome.fasta.ltrTE.trunc

Thank you!

Regards, Shujun

oushujun commented 7 years ago

Hi,

Good observations! Yes, LTR_retriever is using an extended format of LTRharvest outputs. The beginning of the line is the coordinate information. LTR_retriever utilizes this information to generate extra details about the candidate and appended them at the end of each line. For lines that remained in the LTRharvest-like formats, those probably could not pass the initial screening and no further analysis is made. I kept them for the purpose of further checking.

Best, Shujun

On Thu, Jun 8, 2017 at 5:19 AM, rimjhimroy notifications@github.com wrote:

I have also noticed that there are many sequences that are still in LTR-harvest like format in the file genome.fasta.retriever.scn.adj like:

10 4550 4541 10 1974 1965 2553 4550 1998 97.90 12 129 12354 12226 129 2120 1992 10359 12354 1996 98.00 408

There are also few lines in the format:

10069 31145 21077 10069 11357 1289 29870 31145 1276 0.947 scaffold_316 scaffold_316 - CATA 10065..10068, 31146..31149 TG,TG,CA,CA 10090 18546 8457 10090 10517 428 18110 18546 437 0.922 scaffold_446 scaffold_446 + AGTG 10086..10089, 18547..18550 TG,TG,CA,CA

Whereas other lines are in the format:

49618769 49624184 5416 49618769 49619172 404 49623781 49624184 404 0.9926 chr8 chr8 - GAGAA 49618764..49618768 49624185..49624189 TGCA LTR Gypsy 286029 1 10261 10261 1 763 763 9500 10261 762 0.9382 scaffold_586 scaffold_586 - GGTT -3..0 10262..10265 TGCA LTR Gypsy 2480588

I am not sure the program can handle such different row types in this file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/4#issuecomment-307047946, or mute the thread https://github.com/notifications/unsubscribe-auth/AFt-NBmuvHI5N31bWoHdFNyCdoKFv7Suks5sB7yegaJpZM4NzOwH .

oushujun commented 7 years ago

Hi,

I have fixed several bugs and pushed to the latest version (v1.3). Please update your program and rerun the analysis. Please let me know if you have further issues.

Thanks, Shujun Ou