merenlab / illumina-utils

A library and collection of scripts to work with Illumina paired-end data (for CASAVA 1.7+ pipeline).
GNU General Public License v2.0
89 stars 31 forks source link

iu-merge-pairs error #3

Closed nikolay12 closed 7 years ago

nikolay12 commented 7 years ago

Here is my case. I got my V1-V3 data sequenced by an external provider. They say they used the Illumina Casava pipeline version 1.8.3.

As a start I tried to generate a config file. I followed the steps listed at https://github.com/meren/illumina-utils:

I first generated a tab file listing all the sample names and the corresponding paired end fastq files. Than I ran iu-gen-configs and I was surprised that instead of generating a single config file it generated a config file for each sample.

Than I decided to merge the paired end fastq files for each sample by using iu-merge-pairs using the --compute-qual-dicts option. When I ran it for the first sample it produced the following error:

Error: Your input FASTQ files do not seem to be generated by CASAVA 1.8. Please use --ignore-deflines parameter.

I added the parameter as requested. Than I got another error message:

$ iu-merge-pairs --compute-qual-dicts --ignore-deflines 16001_posD09_CCTAAGACACTGCATA.ini
Traceback (most recent call last):
  File "/usr/local/bin/iu-merge-pairs", line 770, in <module>
    sys.exit(merger.run())
  File "/usr/local/bin/iu-merge-pairs", line 398, in run
    tile_number = self.input_1.entry.tile_number
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 82, in __getattr__
    return getattr(self, '_'.join(['process', key]))()
(...)
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 82, in __getattr__
    return getattr(self, '_'.join(['process', key]))()
  File "/Library/Python/2.7/site-packages/IlluminaUtils/lib/fastqlib.py", line 73, in __getattr__
    if key in ['__str__']: 
RuntimeError: maximum recursion depth exceeded in cmp

I don't know what to do now. Can you, perhaps, advise?

meren commented 7 years ago

Hi,

Can you please provide some example files? The first 1,000 lines of R1 and R2 reads for one of your samples in that dataset would be the best (i.e., example-R1.fastq and example-R2.fastq).

Thanks,

nikolay12 commented 7 years ago

Thanks for your quick reply. I'm attaching the first 1000 lines. 16001_posD09_CCTAAGACACTGCATA_R1_1000.fastq.gz 16001_posD09_CCTAAGACACTGCATA_R2_1000.fastq.gz

meren commented 7 years ago

Hi,

I found what causes the error and I will add a control when I have a chance for the next version. Clearly it should be illegal to use --compute-qual-dicts flag when --ignore-deflines is used :( So your short-term solution is to not use --compute-qual-dicts. I am sorry for that.

The reason you are forced to use --ignore-deflines with these files is because their headers are not what we expect to see with CASAVA 1.8+. This is how it should look like:

@D4ZHLFP1:36:C10H4ACXX:8:2203:21201:39665 1:N:0:CCAT

and this is how yours look like:

@HWI-M04481:31:000000000-ARVP8:1:1101:7941:1899/1

If you look at the specification on the Illumina page, you will realize that the format they describe matches to the first one:

http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm

Best,