fastq file has unrecognized type

ssnn-airr commented 7 years ago

Original report by Scott Christley (Bitbucket: [Scott Christley](https://bitbucket.org/Scott Christley), ).

I have a user that uploaded a set of paired-end read files and is trying to run pRESTO on them, but it fails during the assemble stage.

AssemblePairs.py align -1 Bb_R1.t4.fastq -2 Bb_R2.t4.fastq --coord illumina --rc tail --outname Bb_R1.t4
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:112: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  p_matrix[x, i:] = 1 - stats.binom.cdf(x - 1, k[i:], 0.25) - stats.binom.pmf(x, k[i:], 0.25) / 2.0
/work/01114/vdj/lonestar/production/presto-0.5.2/lib/python3.5/site-packages/presto-0.5.2-py3.5.egg/EGG-INFO/scripts/AssemblePairs.py:131: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  z_matrix[x, j:] = (x - k[j:]/4.0)/np.sqrt(3.0/16.0*k[j:])
       START> AssemblePairs
     COMMAND> align
       FILE1> Bb_R1.t4.fastq
       FILE2> Bb_R2.t4.fastq
  COORD_TYPE> illumina
       ALPHA> 1e-05
   MAX_ERROR> 0.3
     MIN_LEN> 8
     MAX_LEN> 1000
SCAN_REVERSE> False
       NPROC> 48

ERROR:  File Bb_R1.t4.fastq has an unrecognized type

technically I probably shouldn't be specifying --coord illumina, but I tried the other possibilities and none of them worked. I also tried running ConvertHeaders with all the different conversion methods but they all produce unrecognized type error. Here are the first few sequences in each file

R1:

@MIG UMI:TCGGCCAACAAA:8
CGCACGTACTAGCAGTGGTATCAACGCAGAGTTCGGTCCAATCAAATCTTGGGGGGAGCACAGACACAGTGCTGCCTGCCCCTTTGTGCCATGGGCTCCAGGCTGCTCTGTTGGGTGCTGCTTTG
+
.5.'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@MIG UMI:ACCTCACGGAGG:14
GGGGCGTACTAGCAGTGGTATCAACGCAGAGTACCTTCACGTGAGGTCTTGGGGGAGAGAAGGTGGTGTGAGGCCATCACGGAAGATGCTGCTGCTTCTGCTGCTTCTGGGGCCAGGCTCCGGGCT
+
..*2IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIEIIIIIIIIII2#
@MIG UMI:ATAAGCCCGAGA:9
CGATCGTACTAGCAGTGGTATCAACGCAGAGTATAATGCCCTGAGATCTTGGGGGAGAGTCCTGCTCCCCTTTCATCAATGCACAGATACAGAAGACCCCTCCGTCATGCAGCATCTGCCATGAG
+
+1%%IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIICIIII

R2:

@MIG UMI:TCGGCCAACAAA:6
GGGGGACTCGGCCCTTTATCTTTGCGCCAGCAGCTCTATAGCGGGGGGGACAGATACGCAGTATTTTGGCCCAGGCACCCGGCTGACAGTGCTCGAGGACCTGAACAAGGTGTAGCTAGAATAAG
+
IIII@7I@IIII@@I@I@@I7@@I@II7III@II7I@7@.IIIIIIIIIIIIIIIIII@IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII@@IIIIIIIIIII@II7IIIIIIIIIIII..%.
@MIG UMI:ACCTCACGGAGG:8
CGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGAGCGGGGGCCTATGGCTACACCTTCGGTTCGGGGACCAGGTTAACCGTTGTAGAGGACCTGAACAAGGTGTAGCTAGAAATAC
+
5IBII;BIIIII.I5I5IIIIBII;IIIIBIIIBIIBIBI;I5IIIIIIIIIIIBIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;II.IIIIIIIIIIII''''
@MIG UMI:ATAAGCCCGAGA:4
CCAGACATCTGTGTACTTCTGTGCCAGCAAGCCCTACGTACAGGATCCTGGAAACACCATATATTTTGGAGAGGGAAGTTGGCTCACTGTTGTAGAGGACCTGAACAACGTGTAGCTAGAACCAA
+
I;.I.II;I;IIIIIII.I;I;III;III;I;;I;;;I;;;;II;;;I;IIIIIIIIIIIIIII;IIII;I;IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;IIIIIIIIIIII.;.#

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

This is indeed a MIGEC header - specifically the consensus sequence output.

Mike is making a change in MIGEC v1.2.7 that should resolve the Bio.SeqIO.index incompatibility.

I've add a migec mode to ConvertHeaders in bc50b30 that should work with the new MIGEC version.

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

Sort of resolved. Added a converter in ConvertHeaders, but supporting malformed headers in a general sense would be too much effort.

ssnn-airr commented 7 years ago

Original comment by Scott Christley (Bitbucket: [Scott Christley](https://bitbucket.org/Scott Christley), ).

I didn't run with only those sequences, I just cut/paste a few sequences from the top of each file. I kinda assumed it was an issue with the read_id (got read id's on the brain today!), but I will try a run with just those few sequences. I've no idea what the format is, I could try to contact the user to ask, but we cannot rely on getting a response. FYI- this isn't a user reported error, I watch jobs and pro-actively check on errors. Users have a tendency to ignore errors and/or not bother to report them...

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

Weird. I'll look at it. It fails with just those top 4 sequences, yes? I suspect it's because the header format isn't supported. Is that MiGEC format?

It would need to use the UMI:TCGGCCAACAAA bit to pair the reads and doesn't know how to extract it. I guess we could also add an optional flag to ignore the headers and just blindly trust that the reads are paired in file order. Makes me nervous, but it's an option.

Oh, BTW... just noticed. --nproc 48 probably won't be appreciably faster than --nproc 20. Scaling is not the best with Python's multiprocessing library...

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

The warning is from an incompatibility between pRESTO v0.5.2 and newer versions of NumPy/SciPy. It should be fixed in v0.5.3. Can you update pRESTO?

The unrecognized type is probably because it thinks .t4.fastq is the file extension instead of .fastq. That shouldn't be happening, because it would mean that os.path.splitext() isn't well behaved. I'll take a look tomorrow.

ssnn-airr commented 7 years ago

Original comment by Scott Christley (Bitbucket: [Scott Christley](https://bitbucket.org/Scott Christley), ).

Yep, you can test with these files.

ssnn-airr commented 7 years ago

Original comment by Scott Christley (Bitbucket: [Scott Christley](https://bitbucket.org/Scott Christley), ).

Yeah, I'll see about upgrading pRESTO for the next VDJServer release.

I tried changing the file names, but same error:

AssemblePairs.py align -1 Bb_R1_t4.fastq -2 Bb_R2_t4.fastq --coord illumina --rc tail --outname Bb_R1_t4
       START> AssemblePairs
     COMMAND> align
       FILE1> Bb_R1_t4.fastq
       FILE2> Bb_R2_t4.fastq
  COORD_TYPE> illumina
       ALPHA> 1e-05
   MAX_ERROR> 0.3
     MIN_LEN> 8
     MAX_LEN> 1000
SCAN_REVERSE> False
       NPROC> 48

ERROR:  File Bb_R1_t4.fastq has an unrecognized type

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

Yeah, it's the header. The new presto has a more informative error message:

ERROR:  File bad_header/test_r1.fastq is invalid with exception Duplicate key 'MIG'

Hrm. Have to think about what to do with this.

ssnn-airr commented 7 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).

This is the problem:

from Bio import SeqIO
x = SeqIO.index('/home/jason/Downloads/test_r1.fastq', 'fastq')

ValueError: Duplicate key 'MIG'

immcantation / presto

fastq file has unrecognized type #54