join_fastq is a bit of a bottleneck

evanfields commented 1 month ago

I recently ran the cleaning and ribodepletion steps of the pipeline (this branch) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py:

At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the mix flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case.

willbradshaw commented 1 month ago

Mm, yes, you're right. This is another case where a dumb serial Python script is eating up excess time.

I don't think the mix flag is a good fit for the broader use case, since I think it keeps the unmerged read pairs separate. But this should be pretty easy to parallelize to bring down the clock time.

On Tue, 30 Jul 2024 at 11:19, Evan Fields @.***> wrote:

I recently ran the cleaning and ribodepletion steps of the pipeline (this branch https://github.com/naobservatory/mgs-workflow/tree/ef_only_cleaning) on four pairs of read files where each read file was about 6GB compressed. Total time was about 5.5 elapsed hours (145 CPU hours). The wall time was significantly affected by join_fastq.py: image.png (view on web) https://github.com/user-attachments/assets/174693ec-63b8-4fdd-884d-2ed4b1e6972d

At a glance, it looks like join_fastq contributes 2-3 hours of the wall time. Mike suggests that for my use case just using the mix flag in BBMerge should be fine; I'm not sure about the general pipeline refactor use case.

— Reply to this email directly, view it on GitHub https://github.com/naobservatory/mgs-workflow/issues/44, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADWVV6V7B7KRJIM3L4ARPSTZO6VJJAVCNFSM6AAAAABLWTLRRKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTQMBZHA2TCNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mikemc commented 1 month ago

If we're able to modify the python script so that it is working on batches of reads (if not the entire read file) rather than looping over individual reads, I think that would give a big speedup. Possibly even looping over individual reads to create the new reads is ok as long as we're writing in batches rather than calling SeqIO.write() for every read. I think I could prototype something in R using the Biostrings package reasonably quickly but I haven't used biopython for a while.

mikemc commented 1 month ago

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

mikemc commented 1 month ago

If we do end up wanting to refactor this to speed it up, here is some R code using the Biostrings package that uses vectorized operations to do the RC'ing and concatenation, working on chunks of reads rather than iterating through individual reads. My test files are only 5 read pairs so I've set the chunk size to 2 reads for testing, but we'd want to set it to as big as the memory on the instance allows, and probably at least 100K or 1M reads.

files <- XVector::open_input_files(c("test/test-r1.fastq","test/test-r2.fastq"))
chunk_size <- 2

i <- 0
while (TRUE) {
  i <- i + 1
  ## Load `chunk_size` records at a time.
  fwd <- Biostrings::readQualityScaledDNAStringSet(files[1], nrec = chunk_size)
  rev_rc <- Biostrings::readQualityScaledDNAStringSet(files[2], nrec = chunk_size) |> 
    Biostrings::reverseComplement()
  if (length(fwd) == 0L) {
    break
  }
  stopifnot(identical(length(fwd), length(rev_rc)))
  cat("processing chunk", i, "...\n")
  seq_new <- Biostrings::xscat(
    fwd,
    Biostrings::DNAString("N"),
    rev_rc
  )
  qual_new <- Biostrings::xscat(
    Biostrings::quality(fwd),
    Biostrings::PhredQuality("!"),
    Biostrings::quality(rev_rc)
  ) |>
    Biostrings::PhredQuality()
  names(seq_new) <- names(fwd) |> stringr::str_replace(" ", " joined ")
  joined <- Biostrings::QualityScaledDNAStringSet(
    seq_new, qual_new
  )
  Biostrings::writeQualityScaledXStringSet(joined, "test-joined-chunk.fastq.gz", append = TRUE, compress = TRUE)
}

On a small test file of 5 reads this gives identical results to the join_paired_reads() python function.

Edit: I tried running a slightly cleaned-up version on a set of ~40M paired-end reads processing 1M records at a time and it took ~40min on my Macbook and used about 4.5G of memory. So it's still not speedy.

willbradshaw commented 1 month ago

@willbradshaw is the only thing the pipeline needs the joined output for Kraken2?

Kraken2 and RC-sensitive single-end deduplication with Clumpify.

naobservatory / mgs-workflow

join_fastq is a bit of a bottleneck #44