rrwick / Porechop

adapter trimmer for Oxford Nanopore reads
GNU General Public License v3.0
322 stars 123 forks source link

error running in porechop #54

Open kanchanhole opened 6 years ago

kanchanhole commented 6 years ago

Hello,

I am trying to use porechop for adaptor trimming on nanopore reads. I constantly get the following error:

Error: input_reads.fastq could not be parsed - is it formatted correctly?

I double checked the file. It is fine.

I am using the following command:

porechop -i input_reads.fastq -o output_reads.fastq Am I missing anything?

TIA

lfaller commented 6 years ago

I got the same error and was able to track it down using nanoplot (it provides a more detailed error message).

NanoPlot --fastq input.fastq.gz --loglength --outdir log_scaled

Traceback (most recent call last):
  File "/home/lina/.local/bin/NanoPlot", line 11, in <module>
    sys.exit(main())
  File "/home/lina/.local/lib/python2.7/site-packages/nanoplot/NanoPlot.py", line 46, in main
    datadf, lengthprefix, logBool, readlengthsPointer = getInput(args)
  File "/home/lina/.local/lib/python2.7/site-packages/nanoplot/NanoPlot.py", line 148, in getInput
    datadf = pd.concat([nanoget.processFastqPlain(inp) for inp in args.fastq], ignore_index=True)
  File "/home/lina/.local/lib/python2.7/site-packages/nanoget/nanoget.py", line 243, in processFastqPlain
    for record in SeqIO.parse(inputfastq, "fastq"):
  File "/home/lina/.local/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 611, in parse
    for r in i:
  File "/home/lina/.local/lib/python2.7/site-packages/Bio/SeqIO/QualityIO.py", line 1033, in FastqPhredIterator
    for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
  File "/home/lina/.local/lib/python2.7/site-packages/Bio/SeqIO/QualityIO.py", line 954, in FastqGeneralIterator
    % (title_line, seq_len, len(quality_string)))
ValueError: Lengths of sequence and quality values differs  for ef273085-907d-49c2-a718-a0ee3a1b71eb runid=184b3ffc1e177f8e044bf254b791cc506e6483ae sampleid=input_sample read=690 ch=402 start_time=2018-04-07T01:46:09Z (4556 and 10912).

I am not sure if you have the same underlying error, but in my case, the sequence and quality lengths ended up differing.

EDIT:

here is the link to NanoPlot: https://github.com/wdecoster/NanoPlot

lfaller commented 6 years ago

Ok, I just ran into this error again but this time, my data seems to be formatted well enough for nanoplot to run.

@rrwick can you think of other formatting issues that could cause this error message?

lfaller commented 6 years ago

Looking into the code, nanoplot uses biopython's modules to read fastq, whereas porechop implements its own fastq reader.

I still don't know what was wonky about my fastq data but was able to use the following code to "sanitize" it so that porechop can read it:

# Inspired by: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc282

import sys
from Bio import SeqIO
from Bio.SeqIO.QualityIO import FastqGeneralIterator

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file) as in_handle:
    with open(output_file, "w") as out_handle:
        for title, seq, qual in FastqGeneralIterator(in_handle):
            out_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))