oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
179 stars 40 forks source link

BLAST engine error and uninitialized values #3

Closed angwg closed 7 years ago

angwg commented 7 years ago

I've been trying to run LTR_retriever on my data with the following command: LTR_retriever -genome c.fa -inharvest c.harvest.scn I'm using a freshly cloned version of LTR_retriever (as of 6/4/2017), and ensured that all files and folders have read/write permissions.

While the program runs to completion, several errors show up and no LTR's are found.

The following errors occur multiple times, in an inconsistent manner across separate runs. Running the program multiple times results in these errors popping up a different number of times in each run.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.

As for the program not finding any LTRs in the dataset, that's strange since the data is a genome wherein a significant portion is composed of transposons. I'm not sure if this is because something went wrong or if it's because there really wasn't anything there. I've also run LTR_retriever with additional inputs but the result is the same. Are there any sample data sets to confirm if the program is working properly?

Here's the entire output that the program gives:

##########################
### LTR_retriever v1.1 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome c.fa -inharvest c.harvest.scn

Tue Jun  6 12:45:12 EDT 2017    The longest sequence ID in the genome contains 117 characters, which is longer than the limit (15)
                                Trying to reformat seq IDs...
                                Attempt 1...
Tue Jun  6 12:45:15 EDT 2017    Seq ID conversion successful!

Tue Jun  6 12:45:15 EDT 2017    Start to convert inputs...
                                Total candidates: 5597
                                Total uniq candidates: 5597

Tue Jun  6 12:45:19 EDT 2017    Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Tue Jun  6 12:49:30 EDT 2017    4654 clean candidates remained

Tue Jun  6 12:49:30 EDT 2017    Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data
BLAST engine error: Warning: Sequence contains no data
BLAST engine error: Warning: Sequence contains no data
Tue Jun  6 13:00:13 EDT 2017    Intact LTR found: 1400

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:00:58 EDT 2017    Start to analyze truncated LTRs...
                                Truncated LTRs without the intact version will be retained in the LTR library.
                                Use -notrunc if you don't want to keep them.

Tue Jun  6 13:00:58 EDT 2017    710 truncated LTRs found
Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:04:47 EDT 2017    123 truncated LTR sequences have added to the library

Tue Jun  6 13:04:47 EDT 2017    Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1092
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Tue Jun  6 13:10:39 EDT 2017    Retained clean sequence: 0

Tue Jun  6 13:10:39 EDT 2017    Sequence clustering for c.fa.mod.ltrTE ...
ERROR: c.fa.mod.ltrTE is empty, please check the last file
Tue Jun  6 13:10:39 EDT 2017    Unique lib sequence: 0

Use of uninitialized value $rLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Use of uninitialized value $lLTR_length in numeric ge (>=) at /usr/local/LTR_retriever/bin/get_range.pl line 125, <TBL> line 5115.
Argument "" isn't numeric in addition (+) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Use of uninitialized value $rLTR_start in subtraction (-) at /usr/local/LTR_retriever/bin/get_range.pl line 132, <TBL> line 5115.
Tue Jun  6 13:10:40 EDT 2017    No LTR was found in your data.

rm: cannot remove ‘c.fa.mod.cat.gz’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib.clust.clstr’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib’: No such file or directory
rm: cannot remove ‘c.fa.mod.LTRlib.fa.n*’: No such file or directory
rm: cannot remove ‘c.fa.mod.nmtf’: No such file or directory
Tue Jun  6 13:10:40 EDT 2017    All analyses were finished!

##############################
####### Result files #########
##############################

Table output for intact LTRs (detailed info)
        c.fa.mod.pass.list (All LTRs)
        c.fa.mod.nmtf.pass.list (Non-TGCA LTRs)

LTR library
        c.fa.mod.LTRlib.fa (All non-redundant LTRs)
        c.fa.mod.nmtf.LTRlib.fa (Non-TGCA LTRs)

GFF3 file for intact LTRs
        c.fa.mod.pass.list.gff3

Whole-genome LTR annotation (GFF)
        c.fa.mod.LTRanno.gff
oushujun commented 7 years ago

Dear George,

Something is still wrong in recognizing the sequence IDs. Could you please send me several lines in the file "c.harvest.scn" and several lines in "c.fa.mod.ltrTE.stg1"? For the second file, you may need to rerun LTR_retriever with the addition of the -v parameter. Also, listing all file information will help me to find out which step is wrong. Simply paste the stdout of the following command will work: ls -l c.fa*

Thanks, Shujun

angwg commented 7 years ago

Hi Shujun, here's the file information and truncated versions of those files. I hope this helps!

File information:

-rwxrwxrwx 1 root     root      599447795 Jun  6 11:40 c.fa
-rwxrwxrwx 1 root     root         863394 Jun  6 11:40 c.fa.des
-rwxrwxrwx 1 root     root      148163792 Jun  6 11:40 c.fa.esq
-rwxrwxrwx 1 root     root      592654030 Jun  6 11:40 c.fa.lcp
-rwxrwxrwx 1 root     root      369982096 Jun  6 11:40 c.fa.llv
-rwxrwxrwx 1 root     root         258291 Jun  6 11:40 c.fa.md5
-rw-rw-r-- 1 mcampbel mcampbel  598678341 Jun  6 15:45 c.fa.mod
-rw-rw-r-- 1 mcampbel mcampbel    1318462 Jun  6 16:01 c.fa.mod.defalse
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.clust
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.clust.clstr
-rw-rw-r-- 1 mcampbel mcampbel   36584656 Jun  6 15:45 c.fa.mod.ltrTE.fa
-rw-rw-r-- 1 mcampbel mcampbel      61962 Jun  6 15:49 c.fa.mod.ltrTE.fa.cleanup
-rw-rw-r-- 1 mcampbel mcampbel    4349201 Jun  6 16:02 c.fa.mod.ltrTE.mask.lib
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.nmtf
-rw-rw-r-- 1 mcampbel mcampbel    8513337 Jun  6 16:01 c.fa.mod.ltrTE.pass
-rw-rw-r-- 1 mcampbel mcampbel     236799 Jun  6 16:02 c.fa.mod.ltrTE.pass.clust.clstr
-rw-rw-r-- 1 mcampbel mcampbel     169060 Jun  6 16:01 c.fa.mod.ltrTE.pass.list
-rw-rw-r-- 1 mcampbel mcampbel      10545 Jun  6 16:12 c.fa.mod.ltrTE.pass.nmtf.list
-rw-rw-r-- 1 mcampbel mcampbel   31292503 Jun  6 15:49 c.fa.mod.ltrTE.stg1
-rw-rw-r-- 1 mcampbel mcampbel    3469593 Jun  6 16:02 c.fa.mod.ltrTE.stg2
-rw-rw-r-- 1 mcampbel mcampbel    3668719 Jun  6 16:06 c.fa.mod.ltrTE.stg3.cln
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.stg3.cln.clean
-rw-rw-r-- 1 mcampbel mcampbel         47 Jun  6 16:12 c.fa.mod.ltrTE.stg3.cln.clean.exclude.list
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.stg3.cln.exclude.list
-rw-rw-r-- 1 mcampbel mcampbel     382770 Jun  6 16:12 c.fa.mod.ltrTE.stg3.dna.out
-rw-rw-r-- 1 mcampbel mcampbel     431120 Jun  6 16:09 c.fa.mod.ltrTE.stg3.line.out
-rw-rw-r-- 1 mcampbel mcampbel     813890 Jun  6 16:12 c.fa.mod.ltrTE.stg3.otherTE.out
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.ltrTE.stg3.plantP.out
-rw-rw-r-- 1 mcampbel mcampbel    4091070 Jun  6 16:02 c.fa.mod.ltrTE.trunc
-rw-rw-r-- 1 mcampbel mcampbel   21259403 Jun  6 16:06 c.fa.mod.ltrTE.trunc.cat
-rw-rw-r-- 1 mcampbel mcampbel     199126 Jun  6 16:06 c.fa.mod.ltrTE.trunc.cln
-rw-rw-r-- 1 mcampbel mcampbel      36070 Jun  6 16:02 c.fa.mod.ltrTE.trunc.list
-rw-rw-r-- 1 mcampbel mcampbel    4171210 Jun  6 16:06 c.fa.mod.ltrTE.trunc.masked
-rw-rw-r-- 1 mcampbel mcampbel      68116 Jun  6 16:06 c.fa.mod.ltrTE.trunc.masked.cleanup
-rw-rw-r-- 1 mcampbel mcampbel     443306 Jun  6 16:06 c.fa.mod.ltrTE.trunc.ori.out
-rw-rw-r-- 1 mcampbel mcampbel     487052 Jun  6 16:06 c.fa.mod.ltrTE.trunc.out
-rw-rw-r-- 1 mcampbel mcampbel       1990 Jun  6 16:06 c.fa.mod.ltrTE.trunc.tbl
-rw-rw-r-- 1 mcampbel mcampbel       4290 Jun  6 16:02 c.fa.mod.ltrTE.veryfalse
-rw-rw-r-- 1 mcampbel mcampbel     879608 Jun  6 16:02 c.fa.mod.ltrTE.veryfalse.fa
-rw-rw-r-- 1 mcampbel mcampbel      10570 Jun  6 16:02 c.fa.mod.ltrTE.veryfalse.list
-rw-rw-r-- 1 mcampbel mcampbel      10545 Jun  6 16:12 c.fa.mod.nmtf.pass.list
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.nmtf.prelib
-rw-rw-r-- 1 mcampbel mcampbel     169060 Jun  6 16:12 c.fa.mod.pass.list
-rw-rw-r-- 1 mcampbel mcampbel          0 Jun  6 16:12 c.fa.mod.prelib
-rw-rw-r-- 1 mcampbel mcampbel     773693 Jun  6 16:12 c.fa.mod.retriever.all.scn.adj
-rw-rw-r-- 1 mcampbel mcampbel      14056 Jun  6 16:12 c.fa.mod.retriever.all.scn.adj.list
-rw-rw-r-- 1 mcampbel mcampbel     398533 Jun  6 15:45 c.fa.mod.retriever.scn
-rw-rw-r-- 1 mcampbel mcampbel     773693 Jun  6 16:01 c.fa.mod.retriever.scn.adj
-rw-rw-r-- 1 mcampbel mcampbel      77124 Jun  6 16:02 c.fa.mod.retriever.scn.adj.list
-rw-rw-r-- 1 mcampbel mcampbel     235826 Jun  6 15:49 c.fa.mod.retriever.scn.extend
-rw-rw-r-- 1 mcampbel mcampbel   31753991 Jun  6 15:49 c.fa.mod.retriever.scn.extend.fa
-rw-rw-r-- 1 mcampbel mcampbel   64637030 Jun  6 15:50 c.fa.mod.retriever.scn.extend.fa.aa
-rw-rw-r-- 1 mcampbel mcampbel     370434 Jun  6 15:51 c.fa.mod.retriever.scn.extend.fa.aa.anno
-rw-rw-r-- 1 mcampbel mcampbel    6767136 Jun  6 15:51 c.fa.mod.retriever.scn.extend.fa.aa.scn
-rw-rw-r-- 1 mcampbel mcampbel    2534574 Jun  6 15:51 c.fa.mod.retriever.scn.extend.fa.aa.tbl
-rw-rw-r-- 1 mcampbel mcampbel     282640 Jun  6 15:45 c.fa.mod.retriever.scn.full
-rw-rw-r-- 1 mcampbel mcampbel     577980 Jun  6 15:45 c.fa.mod.retriever.scn.list
-rwxrwxrwx 1 root     root            482 Jun  6 11:40 c.fa.prj
-rwxrwxrwx 1 root     root          62608 Jun  6 11:40 c.fa.sds
-rwxrwxrwx 1 root     root          31312 Jun  6 11:40 c.fa.ssp
-rwxrwxrwx 1 root     root     4741232240 Jun  6 11:40 c.fa.suf

c.harvest.scn and c.fa.mod.ltrTE.stg1

c.fa.mod.ltrTE.stg1.txt c.harvest.scn.txt

oushujun commented 7 years ago

Dear George,

Thank you for attaching the files along with the file information. From my end, the file format looks OK. The core program identified 1400 intact LTRs but also complaining missing some candidate sequences. There are two possible reasons for this error:

  1. This could be due to the mismatch of seq name(s) after the program conversion. While LTR_retriever tries to convert long sequence names, it's better that you convert sequence names in the genome down to 15 char without any spaces and special characters.
  2. It could also cause by requesting too many CPUs when not enough CPUs are available. Please make sure you have reserved enough CPUs to cover what you specified using -threads

From 1400 intact LTRs to no library produced, this is the other bug when no LINE and DNA TE coding sequence was detected. I have fixed this one and pushed it to GitHub. Please try the new code.

Best, Shujun

angwg commented 7 years ago

Hi Shujun, I downloaded the latest release and I'm getting a similar error. In a separate run, I also ran a script to trim the sequence ids below 15 characters in the genome c.fa before running LTR_harvest on it, but that did not seem to change anything. I also checked the cpu usage on my machine and there are enough cpus available to run on the default settings (4 threads).

Here's the shell output.

##########################
### LTR_retriever v1.2 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome c.fa -inharvest c.harvest.scn

Wed Jun  7 14:38:45 EDT 2017    The longest sequence ID in the genome contains 117 characters, which is longer than the limit (15)
                                Trying to reformat seq IDs...
                                Attempt 1...
Wed Jun  7 14:38:49 EDT 2017    Seq ID conversion successful!

Wed Jun  7 14:38:49 EDT 2017    Start to convert inputs...
                                Total candidates: 5597
                                Total uniq candidates: 5597

Wed Jun  7 14:38:53 EDT 2017    Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Wed Jun  7 14:43:03 EDT 2017    4654 clean candidates remained

Wed Jun  7 14:43:03 EDT 2017    Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data
Wed Jun  7 14:54:00 EDT 2017    Intact LTR found: 1399

Wed Jun  7 14:54:46 EDT 2017    Start to analyze truncated LTRs...
                                Truncated LTRs without the intact version will be retained in the LTR library.
                                Use -notrunc if you don't want to keep them.

Wed Jun  7 14:54:46 EDT 2017    711 truncated LTRs found
Wed Jun  7 14:58:39 EDT 2017    126 truncated LTR sequences have added to the library

Wed Jun  7 14:58:39 EDT 2017    Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1093
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Wed Jun  7 15:04:31 EDT 2017    Retained clean sequence: 0

ERROR: 1399 intact LTRs have found, but the pre-library file c.fa.mod.ltrTE is empty.
Something is wrong at this point. Please report the bug to https://github.com/oushujun/LTR_retriever/issues
Program halt!

shell output for pretrimmed genome

##########################
### LTR_retriever v1.2 ###
##########################

Contributors: Shujun Ou, Ning Jiang

Parameters: -genome ctrim.fa -inharvest c.harvest.scn

Wed Jun  7 15:12:22 EDT 2017    Start to convert inputs...
                                Total candidates: 5597
                                Total uniq candidates: 5597

Wed Jun  7 15:12:27 EDT 2017    Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Wed Jun  7 15:16:38 EDT 2017    4654 clean candidates remained

Wed Jun  7 15:16:38 EDT 2017    Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and family will be identified in this step.

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: NCBI C++ Exception:
    "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_491_130.14.22.10_9052_1363121432/c++/src/algo/blast/api/blast_setup.hpp", line 189: Error: Sequence contains no data

BLAST engine error: Warning: Sequence contains no data
Wed Jun  7 15:28:24 EDT 2017    Intact LTR found: 1401

Wed Jun  7 15:29:09 EDT 2017    Start to analyze truncated LTRs...
                                Truncated LTRs without the intact version will be retained in the LTR library.
                                Use -notrunc if you don't want to keep them.

Wed Jun  7 15:29:10 EDT 2017    710 truncated LTRs found
Wed Jun  7 15:33:00 EDT 2017    125 truncated LTR sequences have added to the library

Wed Jun  7 15:33:00 EDT 2017    Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1094
sort: multi-character tab ‘$\t’
ERROR: LOC list is empty.
Wed Jun  7 15:38:56 EDT 2017    Retained clean sequence: 0

ERROR: 1401 intact LTRs have found, but the pre-library file ctrim.fa.ltrTE is empty.
Something is wrong at this point. Please report the bug to https://github.com/oushujun/LTR_retriever/issues
Program halt!
oushujun commented 7 years ago

Dear George,

I am sorry for the troubles you are experiencing. Could you please send me the file "c.fa.mod.retriever.scn.adj" and "c.fa.mod.defalse" so that I can further determine what was wrong? You can either attach them in this thread or send to my email. Thank you!

Best, Shujun