sestaton / Transposome

A toolkit for annotation of transposable element families from unassembled sequence reads
http://sestaton.github.io/Transposome
MIT License
31 stars 6 forks source link

Input allvall_blast.bln file read as empty by PairFinder #32

Closed eagerowl120 closed 8 years ago

eagerowl120 commented 8 years ago

Hello, I am analysing 100,000 pair reads from plants on a hp loaded with the most recent version of Ubuntu, and although I can generate an allvall_blast.bln file after approximately four hours, I cannot seem to subsequently generate the output files from PairFinder. We originally thought this was a RAM issue and switched to using the 0 option for in_memory, but we had the same result. Any suggestions about how to resolve this would be greatly welcomed. I am attaching my configuration file (switched to .txt to be uploaded) and included below both the terminal output and a sample of our output allvall_blast_bln file.

transposome_config.txt

Terminal output:

michael@michael-HP-xw6600-Workstation:/media/michael/DATA/Transposome_workspace$ transposome --analysis findpairs --config transposome_config.yml --blastdb ICC16207.P7_ACAGTG.001.interlaced_100k_allvall_blast.bln INFO - ======== Transposome version: 0.09.8 (started at: 29-11-2015 18:15:24) ======== INFO - Configuration - Log file for monitoring progress and errors: t_log.txt INFO - Configuration - Sequence file: ICC16207.P7_ACAGTG.001.interlaced_100k.fasta INFO - Configuration - Sequence format: fasta INFO - Configuration - Sequence number for each BLAST process: 100000 INFO - Configuration - Number of CPUs per thread: 2 INFO - Configuration - Number of threads: 3 INFO - Configuration - Output directory: transposome_results_out INFO - Configuration - In-memory analysis: 0 INFO - Configuration - Percent identity for matches: 90 INFO - Configuration - Fraction coverage for pairwise matches: 0.55 INFO - Configuration - Merge threshold for clusters: 500 INFO - Configuration - Minimum cluster size for annotation: 100 INFO - Configuration - BLAST e-value threshold for annotation: 10 INFO - Configuration - Repeat database for annotation: plantDatabase.fasta INFO - Configuration - Log file for clustering/merging results: t_cluster_report.txt ERROR - There seems to be no content in the input file. Check the blast results and try again. Exiting.

A sample of my allvall_blast.bln output:

HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 125 56 HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 56 125 87.32 205 4e-12 - HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 73 2 HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 8 79 89.04 213 1e-15 - HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 92 1 HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 19 110 95.65 317 4e-37 - HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 8 95 HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 38 125 88.76 262 2e-20 + HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 38 125 HISEQ2500:144:C5N2CANXX:4:1309:5221:23094 125 2 89 86.36 254 1e-15 + HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 1 79 HISEQ2500:144:C5N2CANXX:4:1309:5221:23094 125 35 113 86.25 213 1e-12 + HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 118 1 HISEQ2500:144:C5N2CANXX:4:2207:1639:44485 125 1 118 93.28 404 4e-43 - HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 1 117 HISEQ2500:144:C5N2CANXX:4:2207:1639:44485 125 9 125 91.53 385 1e-37 + HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 1 118 HISEQ2500:144:C5N2CANXX:4:1313:3868:5494 125 8 125 89.92 373 1e-33 + HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 125 75 HISEQ2500:144:C5N2CANXX:4:1313:3868:5494 125 63 113 90.38 149 3e-10 - HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 117 3 HISEQ2500:144:C5N2CANXX:4:1313:3868:5494 125 1 115 93.91 400 1e-43 - HISEQ2500:144:C5N2CANXX:4:2207:1639:44485 125 125 1 HISEQ2500:144:C5N2CANXX:4:1313:3868:5494 125 1 125 89.60 395 2e-35 - HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 1 79 HISEQ2500:144:C5N2CANXX:4:1211:16752:73066 125 22 100 91.25 232 3e-22 + HISEQ2500:144:C5N2CANXX:4:1105:6542:86366 125 88 4 HISEQ2500:144:C5N2CANXX:4:1211:16752:73066 125 14 98 89.53 249 5e-21 - HISEQ2500:144:C5N2CANXX:4:2316:5931:81662 125 118 32 HISEQ2500:144:C5N2CANXX:4:1211:16752:73066 125 14 100 93.10 267 2e-29 - HISEQ2500:144:C5N2CANXX:4:2207:1639:44485 125 125 40 HISEQ2500:144:C5N2CANXX:4:1211:16752:73066 125 15 100 91.95 260 2e-26 - HISEQ2500:144:C5N2CANXX:4:1305:13667:30146 125 73 1 HISEQ2500:144:C5N2CANXX:4:2215:12889:76209 125 6

sestaton commented 8 years ago

Just to be certain, can you confirm that you are looking at the same blast files? You show a sample of the file "allvall_blast.bln" but the command line shows "ICC16207.P7_ACAGTG.001.interlaced_100k_allvall_blast.bln" being used.

eagerowl120 commented 8 years ago

Hi Evan,

Sorry for the confusion. The output from my blast analysis was ICC16207.P7_ACAGTG.001.interlaced_100k_allvall_blast.bln and I used that as my input for the PairFinder analysis. The sample from below is from that file.

On Nov 30, 2015, at 9:43 AM, Evan Staton notifications@github.com wrote:

Just to be certain, can you confirm that you are looking at the same blast files? You show a sample of the file "allvall_blast.bln" but the command line shows "ICC16207.P7_ACAGTG.001.interlaced_100k_allvall_blast.bln" being used.

— Reply to this email directly or view it on GitHub.

sestaton commented 8 years ago

Do you have a way to share the sequences or blast file? I'd like to run it with the same conditions and see what happens. Or, if you tell me the size of the files I can create a place to put the data.

eagerowl120 commented 8 years ago

Hi Evan,

The blast file is 52.9 GB, and that would be great, please let me know what you need me to do.

sestaton commented 8 years ago

You can transfer the file with the following method:

sftp eagerowl120@45.55.25.151

When prompted for the password, use issue32. Then, upload the file with put file where file is the filename. I'll try to test it right away.

sestaton commented 8 years ago

Also, let me know when you get the file uploaded because this is a paid cloud image I'm using for this issue, which is no so cheap, so I don't want to keep it around too long. Thanks.

eagerowl120 commented 8 years ago

Hi Evan, I was able to set up the sftp but the password did not work, permission was denied.

On Mon, Nov 30, 2015 at 4:26 PM, Evan Staton notifications@github.com wrote:

Also, let me know when you get the file uploaded because this is a paid cloud image I'm using for this issue, which is no so cheap, so I don't want to keep it around too long. Thanks.

— Reply to this email directly or view it on GitHub https://github.com/sestaton/Transposome/issues/32#issuecomment-160807538 .

Michael Lough-Stevens Intended M.S. Biology, Villanova 2013-2015 B.A. Integrative Biology, UC Berkeley 2011

sestaton commented 8 years ago

My bad, it should work now.

eagerowl120 commented 8 years ago

Hi Evan,

It's uploading now, should be done approximately into 6 hours.

Michael

On Mon, Nov 30, 2015 at 5:43 PM, Evan Staton notifications@github.com wrote:

My bad, it should work now.

— Reply to this email directly or view it on GitHub https://github.com/sestaton/Transposome/issues/32#issuecomment-160820047 .

Michael Lough-Stevens Intended M.S. Biology, Villanova 2013-2015 B.A. Integrative Biology, UC Berkeley 2011

sestaton commented 8 years ago

I see. It would be better to compress the file before uploading. Please stop the transfer and compress it locally (with bzip2 file.bln) and upload the *bz2 file. That will take a long time to compress but a fraction of the time to transfer. Probably less time over all and easier to work with.

eagerowl120 commented 8 years ago

I stopped the transfer, compressing now and then I'll send it to you

On Nov 30, 2015, at 6:25 PM, Evan Staton notifications@github.com wrote:

I see. It would be better to compress the file before uploading. Please stop the transfer and compress it locally (with bzip2 file.bln) and upload the *bz2 file. That will take a long time to compress but a fraction of the time to transfer. Probably less time over all and easier to work with.

— Reply to this email directly or view it on GitHub.

eagerowl120 commented 8 years ago

Hi Evan,

I have just uploaded my .bz2 file to your cloud. Thank you very much!

Michael

On Mon, Nov 30, 2015 at 6:25 PM, Evan Staton notifications@github.com wrote:

I see. It would be better to compress the file before uploading. Please stop the transfer and compress it locally (with bzip2 file.bln) and upload the *bz2 file. That will take a long time to compress but a fraction of the time to transfer. Probably less time over all and easier to work with.

— Reply to this email directly or view it on GitHub https://github.com/sestaton/Transposome/issues/32#issuecomment-160827520 .

Michael Lough-Stevens Intended M.S. Biology, Villanova 2013-2015 B.A. Integrative Biology, UC Berkeley 2011

sestaton commented 8 years ago

Hi Michael,

I think it must be your command line is all I can think of. That error is definitely from the PairFinder module, but I don't get the same errors with your blast file. What I did was follow the installation instructions for Ubuntu (I'm using 15.04):

apt-get install -y build-essential lib32z1 git ncbi-blast+ curl
curl -L cpanmin.us | perl - git://github.com/sestaton/Transposome.git

Get the config file:

curl -L tr.im/transposomeconfig > transposome_config.yml 

Edit the file to set "in_memory = 1" as below:

## For more information about this file, see: 
## https://github.com/sestaton/Transposome/wiki/Specifications-and-example-usage.
blast_input:
  - sequence_file:      t_reads.fas
  - sequence_format:    fasta
  - thread:             2
  - output_directory:   transposome_results_out
clustering_options:
  - in_memory:          0
  - percent_identity:   90
  - fraction_coverage:  0.55
annotation_input:
  - repeat_database:    repeats.fas
annotation_options:
  - cluster_size:       100
output:
  - run_log_file:       t_log.txt
  - cluster_log_file:   t_cluster_report.txt

Note that the repeat database and sequence file are not used in this step so I just created files with those files with a couple of lines of made up data. Otherwise, you will get an error if a field is blank or the file is empty.

Next, I ran the command:

nohup time transposome --analysis findpairs --config transposome_config.yml --blastdb ICC16207.P7_ACAGTG.001.interlaced_100k_allvall_blast.bln.bz2 2>&1 > transp.out &

And here is output so far:

INFO - ======== Transposome version: 0.09.8 (started at: 01-12-2015 17:37:10) ========
INFO - Configuration - Log file for monitoring progress and errors: t_log.txt
INFO - Configuration - Sequence file:                               t_reads.fas
INFO - Configuration - Sequence format:                             fasta
INFO - Configuration - Sequence number for each BLAST process:      50000
INFO - Configuration - Number of CPUs per thread:                   1
INFO - Configuration - Number of threads:                           2
INFO - Configuration - Output directory:                            transposome_results_out
INFO - Configuration - In-memory analysis:                          0
INFO - Configuration - Percent identity for matches:                90
INFO - Configuration - Fraction coverage for pairwise matches:      0.55
INFO - Configuration - Merge threshold for clusters:                0.001
INFO - Configuration - Minimum cluster size for annotation:         100
INFO - Configuration - BLAST e-value threshold for annotation:      10
INFO - Configuration - Repeat database for annotation:              repeats.fas
INFO - Configuration - Log file for clustering/merging results:     t_cluster_report.txt
INFO - Transposome::PairFinder::parse_blast started at:   01-12-2015 17:37:10.

As you can see, no error and the log indicates that the parse_blast method is running (which I can see with 'top'). I'll let it run for a little while, but I don't see any reason it will fail as your command did, which happened when the class object was created, long before the parsing started as you can see in your log output (no 'parse_blast started' message). Can you double-check your files and commands and try again?

Thanks, Evan

sestaton commented 8 years ago

Any updates? I'd like to close this old issue since it seems like it might have been resolved.

sestaton commented 8 years ago

Closing now. Please comment if there is any new information.