Open willbradshaw opened 1 month ago
Is this process simply taking a set of read ids specified by id_file
and getting the corresponding reads from the fastq files in reads_unconc
? If so, then I'm guessing we can do this much faster (and probably about as fast as we'll get) with seqtk subseq
(https://github.com/lh3/seqtk) instead of biopython. (This is the approach I used for pulling the reads mapped to specific taxa from their ids in the HV hits tsv.) This can be done on a local file or streaming from S3, e.g.
aws s3 cp ${base_uri}/READS_1.fastq.gz - | seqtk subseq - READS.LIST > OUT_1.fastq
aws s3 cp ${base_uri}/READS_2.fastq.gz - | seqtk subseq - READS.LIST > OUT_2.fastq
If I'm understanding this correctly, I can try this out.
If the read-id input already has the needed format of one read id per line, then I think the following should work:
inp0="!{id_file}"
inp1="!{reads_unconc[0]}"
inp2="!{reads_unconc[1]}"
outp1="!{sample}_bowtie2_mapped_unconc_1.fastq.gz"
outp2="!{sample}_bowtie2_mapped_unconc_2.fastq.gz"
seqtk subseq $inp1 $inp0 | gzip > $outp1
seqtk subseq $inp2 $inp0 | gzip > $outp2
Thanks @mikemc, I've tried implementing your suggestion in the dev branch and will close this if that turns out to fix the problem.
The
EXTRACT_UNCONC_READS
process is often limiting for large runs. Need to rework it to be more efficient.