naobservatory / mgs-workflow

3 stars 2 forks source link

EXTRACT_UNCONC_READS is too slow #10

Open willbradshaw opened 1 month ago

willbradshaw commented 1 month ago

The EXTRACT_UNCONC_READS process is often limiting for large runs. Need to rework it to be more efficient.

mikemc commented 2 weeks ago

Is this process simply taking a set of read ids specified by id_file and getting the corresponding reads from the fastq files in reads_unconc? If so, then I'm guessing we can do this much faster (and probably about as fast as we'll get) with seqtk subseq (https://github.com/lh3/seqtk) instead of biopython. (This is the approach I used for pulling the reads mapped to specific taxa from their ids in the HV hits tsv.) This can be done on a local file or streaming from S3, e.g.

aws s3 cp ${base_uri}/READS_1.fastq.gz - | seqtk subseq - READS.LIST > OUT_1.fastq
aws s3 cp ${base_uri}/READS_2.fastq.gz - | seqtk subseq - READS.LIST > OUT_2.fastq

If I'm understanding this correctly, I can try this out.

mikemc commented 2 weeks ago

If the read-id input already has the needed format of one read id per line, then I think the following should work:

        inp0="!{id_file}"
        inp1="!{reads_unconc[0]}"
        inp2="!{reads_unconc[1]}"
        outp1="!{sample}_bowtie2_mapped_unconc_1.fastq.gz"
        outp2="!{sample}_bowtie2_mapped_unconc_2.fastq.gz"
        seqtk subseq $inp1 $inp0 | gzip > $outp1
        seqtk subseq $inp2 $inp0 | gzip > $outp2
willbradshaw commented 1 week ago

Thanks @mikemc, I've tried implementing your suggestion in the dev branch and will close this if that turns out to fix the problem.