jts / nanocorrect

Experimental pipeline for correcting nanopore reads
MIT License
39 stars 10 forks source link

corrected.fasta empty #4

Closed macmanes closed 9 years ago

macmanes commented 9 years ago

Demo'ing the pipeline with burnin data from phage. I get some warning, but not sure what to do. I'm guessing the DALIGNER step has failed, but not sure.

>make -f pipeline.make INPUT=../2d.fasta NAME=lambda
#a lot of screen output, no errors to make me think that it didnt work properly

>python nanocorrect.py lambda 1000:1020 > corrected.fasta
...
...Wrote 2 sequences to CLUSTAL file clustal-1016.out...
...Read 1 sequences from sequence file poa.input.1017.fa...
*** WARNING: bundling ended prematurely after 1 bundles.
No sequences fit inside this last bundle.
A total of 1 sequences incuding consensus were bundled.

corrected.fasta is empty

>more clustal-1019.out

CLUSTAL W (1.74) multiple sequence alignment

poabaseread                         GTCTGTTGTGATATATTCCGGCGTGCTTGGGTGTTAACCTGGCGGCATAC
CONSENS0                            GTCTGTTGTGATATATTCCGGCGTGCTTGGGTGTTAACCTGGCGGCATAC
....

>more poa.input.1019.fa  #THIS CONTAINS ONLY 1 READ, WHICH I ASSUME IS THE ISSUE.
>poabaseread
GTCTGTTGTGATATATTCCGGCGTGCTTGGGTGTTAACCTGGCGGCATACTCGCGCGGGTTTTTCGCTATTTATGAAAATTTCCCGGTTTACGGCGTTTCCGTTCTTCTTTGCGTCAGACTTAATGTTTTATTTAAAATACCTGGACGAAAAGAAGGAAACGACAGTAGCTGAAATAGCGAGCTTTTGGCTCTGTCGTTTCCTTTCTGATTTGTCCTTGCGAATGAACAATGGAATCA
macmanes commented 9 years ago

with closer inspection of the make.pipeline screen output, there is a seg fault buried in there which seems to be related to https://github.com/thegenemyers/DALIGNER:

HPCcommands.txt: line 2: 64830 Segmentation fault      (core dumped) daligner -d -t5 lambda.1 lambda.1
LAcat lambda > lambda.las
rm lambda.*.las

cc: @thegenemyers

jts commented 9 years ago

Can you provide example data to reproduce the problem?

macmanes commented 9 years ago

Sure, this is lambda burnin data. It assembles into 1 contigs with 97% seq similarity to lambda genome using wgs v8.3.

https://unh.box.com/2d-fasta-gz

macmanes commented 9 years ago

Jared, not sure if this is helpful, but pulled this out of dmesg

daligner[5428]: segfault at 7fff74000000 ip 00007fe18c8ab9bc sp 00007fff7595b6d8 error 4 in libc-2.19.so[7fe18c828000+1bb000]
macmanes commented 9 years ago

Any progress on this issue?

jts commented 9 years ago

Gene released a new version of DALIGNER, did you try it on your data? If you still have the problem I'll grab your demo data and check it out myself.

macmanes commented 9 years ago

Awesome. I did. It know. I'll check and report back.

kthlnktng commented 9 years ago

I had a similar issue with nanocorrect.py (also running lambda burn-in data), yielding these errors:

*** WARNING: bundling ended prematurely after 2 bundles.
No sequences fit inside this last bundle.
A total of 1 sequences incuding consensus were bundled.

The same error is repeated for every iterative alignment. I do get output in the corrected sequences fasta file, however. This warning looks to be deriving from POA, but is it something to be concerned about if I get output?

I don't have the same seg fault problem with DALIGNER as @macmanes, though. This is the output I got for the make:

$ make -f /home/apps/nanocorrect/nanocorrect-20150421/nanocorrect-overlap.make INPUT=pass_2D.fasta NAME=pass_2D
fasta2DB pass_2D pass_2D.pp.fasta
DBsplit -s50 pass_2D
DBdust pass_2D
HPCdaligner -t5 -mdust pass_2D > HPCcommands.txt
/bin/bash HPCcommands.txt
LAcat pass_2D > pass_2D.las
rm pass_2D.*.las
thegenemyers commented 9 years ago

This is not an error message from my code. -- Gene

On 4/23/15, 11:07 PM, Kathleen Keating wrote:

I had a similar issue with nanocorrect.py (also running lambda burn-in data), yielding these errors:

*\ WARNING: bundling ended prematurely after 2 bundles. No sequences fit inside this last bundle. A total of 1 sequences incuding consensus were bundled.

The same error is repeated for every iterative alignment. I do get output in the corrected sequences fasta file, however. This warning looks to be deriving from POA, but is it something to be concerned about if I get output?

I don't have the same seg fault problem with DALIGNER as @macmanes https://github.com/macmanes, though. This is the output I got for the make:

$ make -f /home/apps/nanocorrect/nanocorrect-20150421/nanocorrect-overlap.make INPUT=pass_2D.fasta NAME=pass_2D fasta2DB pass_2D pass_2D.pp.fasta DBsplit -s50 pass_2D DBdust pass_2D HPCdaligner -t5 -mdust pass_2D > HPCcommands.txt /bin/bash HPCcommands.txt LAcat pass_2D > pass_2D.las rm pass_2D.*.las

— Reply to this email directly or view it on GitHub https://github.com/jts/nanocorrect/issues/4#issuecomment-95719238.

jts commented 9 years ago

That is a message from POA and I believe it indicates only one input sequence was provided. This can happen if no overlaps for a read are found, or if there was a problem parsing the DALIGNER output.

I just downloaded @macmanes data and it is running as expected on my machine.

kthlnktng commented 9 years ago

I assume this may be a versioning problem on my end. Can you tell me what versions of the dependencies that you are using?

jts commented 9 years ago

DAZZ_DB: 8cb2f29c4011a2c2 daligner: 549da77b91395dd nanocorrect: 9fbba13e poaV2: http://downloads.sourceforge.net/project/poamsa/poamsa/2.0/poaV2.tar.gz

It might also be an install problem, nanocorrect is fragile right now (but we're working on it)

bforde commented 9 years ago

I have also encountered an issue with producng a correct reads file. I am running your pipeline on our own test dataset. When nanocorrect.py is called I get the following error

python nanocorrect/makerange.py raw.reads.fasta | parallel -v --progress -P 30 'python nanocorrect/nanocorrect.py raw.reads {} > raw.reads.{}.corrected.fasta' /bin/bash -c python\ nanocorrect/nanocorrect.py\ raw.reads\ {}\ >\ raw.reads.{}.corrected.fasta

Computers / CPU cores / Max jobs to run 1:local / 40 / 1

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete Traceback (most recent call last): File "nanocorrect/nanocorrect.py", line 177, in (start, end) = [ int(x) for x in read_range.split(':') ] ValueError: invalid literal for int() with base 10: '{}' local:0/1/100%/0.0s close failed in file object destructor: sys.excepthook is missing lost sys.stderr make: *\ [raw.reads.corrected.fasta] Error 1

Are the reads not being properly indexed?

running makerange.py does produce the expected out put 1:50 51:100 101:150 151:200 201:250 ... 3251:3300 3301:3350 3351:3400 3401:3450

Brian

jts commented 9 years ago

It looks like the token that parallel should replace, '{}', is not being replaced. This is probably due to quoting. Not sure why it works on our system but not yours. What OS and version of GNU parallel are you using?

nickloman commented 9 years ago

There does seem to be lots of variation in GNU parallels, particularly as Ubuntu sometimes uses 'tollef' mode by default (set in an options file) and there is another version of parallel in 'moreutils'. What I would suggest is downloading the latest GNU parallel and building from source, and then ensuring it is first in the PATH (or calling it directly). Alternatively try adding '--gnu' to the parallel arguments. If this fails then you have the moreutils version which you don't want.

jts commented 9 years ago

I've just made a few changes to catch these problems and provide a readable error message. If anyone is still having issues running nanocorrect please open a new issue.