iu-subsample-fastq not recognizing paired ends

luuuuuuuke commented 3 years ago

I am trying to subsample paired end fastq reads with iu-subsample-fastq but I am getting an error that states that the paired end reads are different lengths. However, when I grep the "@" from each paired end fastq file I get the same number of reads. Any ideas what's happening here? Screenshot at 2020-10-19 13-43-39

luuuuuuuke commented 3 years ago

forgot to mention, i'm using illumina-utils v2.8

meren commented 3 years ago

This is weird :(

Can you wc -l both files without grep?

luuuuuuuke commented 3 years ago

Sure, see screenshot--still looks like the same length :/ Screenshot at 2020-10-19 14-42-10

meren commented 3 years ago

I found the issue. The code compares file lengths, which are only very crude approximations to number of lines based on file size to avoid substantial wait times to learn the actual number of reads (sounds stupid, but is very useful to show a progress bar without knowing the actual number of reads). So when the file sizes between R1 and R2 differ even slightly, those lines of code assume that these files have different number of reads. @ekiefl couldn't know that, and I missed those lines to correct the problem when we could.

But lets solve your problem for now, and I will keep this issue open to address it later. Please find the exact location of the program on your computer by running this:

which iu-subsample-fastq

then open that file in your text editor and remove these three lines from it before saving the file (they should be around line 58):

        if int(input_fastq_2.file_length) != num_input_reads:
            raise h.ConfigError("These aren't paired FASTQ files. The length of --r1 is {} but the length of --r2 is {}".\
                                 format(int(input_fastq_2.file_length), num_input_reads))

This should solve it :)

luuuuuuuke commented 3 years ago

Ok--those changes have been made and it resolved that error, but now it is giving a NameError (NameError: name 'input_fastq_2' is not defined) inputfastqerror

meren commented 3 years ago

The only way that could have happened is this: you unintentionally deleted the line (where input_fastq_2 is defined) that is above the lines I suggested you to delete:

input_fastq_2 = u.FastQSource(input_file_path_2)

You can see it in the original: https://github.com/merenlab/illumina-utils/blob/master/scripts/iu-subsample-fastq#L57

Sorry this has been a pain, Luke.

luuuuuuuke commented 3 years ago

Hi Meren--apologies for the late response. You were right about the accidental deletion of one line of code, which our admin has corrected. I re-ran the same command and have not gotten an error, but the program has been running for 15 days and hasn't finished. Do you have any suggestions for how to do this more efficiently? I don't believe the fastq files I am using (11.4Gb) are particularly large compared to other metagenomic datasets.

meren commented 3 years ago

I re-ran the same command and have not gotten an error, but the program has been running for 15 days and hasn't finished.

o_O

This is very bad :) It shouldn't have taken more than a few minutes. I would like to look into this and release another version of illumina-utils probably.

Just to make sure I'm 100% on the right track: we want iu-subsample-fastq to run crazy fast, right?

luuuuuuuke commented 3 years ago

haha yes that is the goal!

meren commented 3 years ago

Hey @luuuuuuuke, apologies for this again.

If you run the following command in your environment, it should update your illumina-utils to v2.10:

pip install --upgrade illumina-utils

After that, the iu-subsample-fastq should take significantly less time (like minutes rather than days).

Please let me know if something goes wrong.

luuuuuuuke commented 3 years ago

This worked perfectly in about 3 minutes--thank you!

meren commented 3 years ago

Excellent! :)

merenlab / illumina-utils

iu-subsample-fastq not recognizing paired ends #27