tszalay / poreseq

Error correction and variant calling algorithm for nanopore sequencing
25 stars 4 forks source link

poreseq consensus error #2

Open tramaraj opened 9 years ago

tramaraj commented 9 years ago

Hi Authors -

I have using poreseq to error correct and assemble minion data for a bacterial genome.

I have a fasta file generated from the fast5 files. I ran it through the first step which is alignment and it ran fine and I got a bam file from it. I then ran consensus using the fasta file, alignment bam file, and pointed it to directory which contains all of the fast5 files. When I run it I get the following error messages,

Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/df20651c-9ea2-4ea3-a2ec-5cc5544f662e_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/df20651c-9ea2-4ea3-a2ec-5cc5544f662e_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/d8c2748e-fe20-4ff2-bcbd-dc8fed6095ec_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/d8c2748e-fe20-4ff2-bcbd-dc8fed6095ec_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/56cf5b01-22d0-46bd-a636-e2a87c0bfd81_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/56cf5b01-22d0-46bd-a636-e2a87c0bfd81_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/b20a1955-51e4-4e4d-822d-bd65e6657739_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/b20a1955-51e4-4e4d-822d-bd65e6657739_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/111f2df0-b8fa-4125-b76c-63171b10c3b9_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/111f2df0-b8fa-4125-b76c-63171b10c3b9_basecall_2d_000_2d', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0) Unable to open file (Unable to open file: name = '/home/projects/sphene/minion/agrobacterium/agro_wgs/ch453_file33_twodirections:/home/projects/sphene/minion/agrobacterium/agro_wgs/us_f7rbl02_agrowgslib4_3617_1_ch453_file33_strand.fast5', errno = 2, error message = 'no such file or directory', flags = 0, o_flags = 0)

it keeps going.....

Here is the command I used,

poreseq consensus -p /sw/compbio/poreseq/poreseq/defaults.conf /home/projects/sphene/minion/PBcR/agro/Agro.2dir.nolambda.minion.fasta /home/projects/sphene/minion/PoreSeq/agro/allaligns_poreseq_default.bam /home/projects/sphene/minion/Agrobacterium/Agro_WGS -o corrected.fasta

Please let me know if you have any thoughts on rectifying this problem.

Thank You. Thiru.

tszalay commented 9 years ago

Hi Thiru,

The code uses the header names in the BAM (from the fastas used in alignment) to locate the fast5 files; more specifically, the names should be the filenames of the .fast5 files. It would appear that this is not the case, so I would recommend recreating the fasta using "poreseq extract" and realigning, which should use the correct headers.

Let me know if that works!

-Tamas

tramaraj commented 9 years ago

Great!

Thanks, Tamas.

Will give it a try and let you know how it goes.

Thiru.

tramaraj commented 9 years ago

Hi Tamas -

Thanks for your suggestions.

The extract and align programs ran to completion in a very reasonable time frame. Then I ran the consensus program which is still running and I started it on Friday. Seems to be too long for a bacterial genome size, approximately 30X coverage. Is is this expected and is there a way to speed this process?

Please let me know if you have any suggestions/comments regarding the runtime issue.

Thank You. Thiru.

tszalay commented 9 years ago

Hi Thiru,

Doing self-correction on a bacterial genome is an extremely time consuming procedure, and you will likely need access to a large cluster for it to complete in a reasonable amount of time. The readme points out that self-correcting all the fragments in lambda can take tens of hours, so a bacterial genome could easily take thousands of CPU hours. I would recommend attempting to first see if PBcR/Celera can assemble the uncorrected fragments without using poreseq, and then running the poreseq error correction on the assembled sequence to refine the accuracy. If that doesn't work, you might also want to look at Loman and Simpson's DALIGNER/POA-based self correction, which might be faster (but still might need a cluster).

Best, Tamas

tramaraj commented 9 years ago

Hi Tamas -

Thanks for your note!

If I follow your suggestion and assemble the uncorrected fragments using PBcR/Celera and then run the poreseq error correction on the assembled sequence, won't I run into the same error we encountered earlier with the header issues. It is going to fail trying to use the header names in the BAM (from the fastas used in alignment) to locate the fast5 files as you mentioned earlier.

Please let me know what you think

Thanks again. Thiru.

tszalay commented 9 years ago

Hi Thiru,

When the error correction runs in poreseq, it takes one set of sequences that get error-corrected and another set of sequences that it uses for the correction (in the case of self-correction, both sets of sequences are the same). The sequences used for the reads must be the correctly-named FASTA sequences extracted from ONT reads, but the sequences being corrected can be anything, including output from PBcR.

For example (if pbcr_assemblies.fasta is your output from pbcr): poreseq_align ./pbcr_assemblies.fasta ./allreads.fasta pbcr_aligns (note that the pbcr file has to go first, so that it is treated as the reference in the BAM file) poreseq consensus ./pbcr_assemblies.fasta ./pbcr_aligns.bam /media/run-33 -o pbcr_corrected.fasta [other flags] (here again, ref is the pbcr-assembled file and it looks for the names of the sequences aligned to the reference in the given folder)

I'll add these to the README soon, I should have made that clearer, sorry.

Best, -Tamas