open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
85 stars 24 forks source link

chunking and fastqc assumes gzipped fastqs #86

Open golobor opened 6 years ago

sergpolly commented 6 years ago

it happens here:


process chunk_fastqs {
    tag "library:${library} run:${run}"
    storeDir getIntermediateDir('fastq_chunks')

    input:
    set val(library), val(run),file(fastq1), file(fastq2) from LIB_RUN_FASTQS_FOR_CHUNK

    output:
    set library, run, 
        "${library}.${run}.*.1.fastq.gz", 
        "${library}.${run}.*.2.fastq.gz" into LIB_RUN_FASTQ_CHUNKED

    script:
    chunksize_lines = 4 * params['map'].chunksize

    """
    zcat ${fastq1} | split -l ${chunksize_lines} -d \
        --filter 'pbgzip -c -n ${task.cpus} > \$FILE.1.fastq.gz' - \
        ${library}.${run}.
    zcat ${fastq2} | split -l ${chunksize_lines} -d \
        --filter 'pbgzip -c -n ${task.cpus} > \$FILE.2.fastq.gz' - \
        ${library}.${run}.
    """
}

where we feed input ${fastq1/2} through zcat without checking if it's zipped or not...

check the suffix somehow , or ... simply throw an error , if we don't want to deal with anything than zipped fastqs.

Also todo: i'd like to see how cextflow's splitFastqs (https://www.nextflow.io/docs/latest/operator.html#splitfastq) works - maybe it's timely to test along with fixing this bug

sergpolly commented 6 years ago

it's actually a bug - @golobor add a label - to stick attention

golobor commented 5 years ago

@sergpolly remind me again - how is this a bug?.. fastqs are way too big to be distributed uncompressed, it's safe to assume that they are gzipped. As for the splitFastq - it does work, but we do have more control using custom chunking processes, e.g. we do not duplicate data the way splitFastq does, which is a major factor for big projects.

sergpolly commented 5 years ago

@golobor it was someone in the lab, or elsewhere (maybe even myself) who tried to feed uncompressed fastq-s into distiller - and error or behaviour seems rather cryptic at that time.

I understand that this is a ridiculous scenario but nonetheless. At least if we keep this issue someone might find out about it and we would not forget to mention it in the docs