Closed florencebejo closed 2 years ago
Can you try to run this large file without BiocParallel, run it on single core? This might work as in R things are not shared between processes, they are copied over.
Thank you @JokingHero for the quick answer.
I used the amplicanPipeline
function with the default use_parallel = FALSE
. Do I need to do something else to deactivate BiocParallel?
Edit: I see in the code of amplicanAlign
that use_parallel = FALSE
does not completely skip the BiocParallel
package but sets up the evaluation of the function makeAlignment
by BiocParallel::bplapply
to be serial.
p <- if (!use_parallel) {
BiocParallel::SerialParam()
}
else {
BiocParallel::bpparam()
}
This is the "recommended" way to disable parallelism using BiocParallel, but the source of the error I suspect comes from the alignment function which is Biostrings::pairwiseAlignment
, can you confirm? Maybe you can pinpoint which line is crashing?
How big is your fastq file?
This is the "recommended" way to disable parallelism using BiocParallel
Indeed, but I just wanted to be sure it was what you meant with "without BiocParallel".
I suspect comes from the alignment function which is Biostrings::pairwiseAlignment, can you confirm?
From traceback()
, I only get this:
5: stop(.error_bplist(res))
4: BiocParallel::bplapply(configSplit, FUN = makeAlignment, average_quality,
min_quality, scoring_matrix, gap_opening, gap_extension,
fastqfiles, primer_mismatch, donor_mismatch, BPPARAM = p)
3: BiocParallel::bplapply(configSplit, FUN = makeAlignment, average_quality,
min_quality, scoring_matrix, gap_opening, gap_extension,
fastqfiles, primer_mismatch, donor_mismatch, BPPARAM = p)
2: amplicanAlign(config = config, fastq_folder = fastq_folder, use_parallel = use_parallel,
average_quality = average_quality, scoring_matrix = scoring_matrix,
gap_opening = gap_opening, gap_extension = gap_extension,
min_quality = min_quality, fastqfiles = fastqfiles, primer_mismatch = primer_mismatch,
donor_mismatch = donor_mismatch)
I am not sure how to get more information about it...
How big is your fastq file?
The fastq files are around 6 Gb (~69 million rows). As I am running the pipeline for someone else, I do not have all the technical details about the files (how they were generated, ...) but I could get more info if needed.
It will take quite some time for me to find time to look at this. Maybe if you find the time you could try to debug to the lowest function that errors or even create pull request with the fix?
I will do my best to get as much information as possible on the error. I am however not sure I will manage to fix it myself. If I do, I will of course create a pull request but I would appreciate any help you or others can provide with this issue.
I replaced BiocParallel::bplapply
by the base R lapply
to get the full traceback for the error (see below) and it confirms my initial suspicion: it tries to create a R matrix that is larger than what R can handle within this data type (i.e., ~2.147e+9 elements).
8: asMethod(object)
7: methods::as(quality(reads), "matrix")
6: matrixStats::rowMins(methods::as(quality(reads), "matrix"), na.rm = TRUE) at helpers_filters.R#153
5: goodBaseQuality(fwdT, min = min_quality) at helpers_alignment.R#167
4: FUN(X[[i]], ...)
3: lapply(configSplit, FUN = makeAlignment, average_quality, min_quality,
scoring_matrix, gap_opening, gap_extension, fastqfiles, primer_mismatch,
donor_mismatch)
2: lapply(configSplit, FUN = makeAlignment, average_quality, min_quality,
scoring_matrix, gap_opening, gap_extension, fastqfiles, primer_mismatch,
donor_mismatch) at #57
1: my_amplicanAlign(config_path, fastq_folder)
I have added batch processing for fastq filtering, which is the line that breaks for you. Can you check whether it works?
Sorry, I was writing the pull request when you sent your code... 😅 Tested it and your fix seems to work as well. I took a similar approach to yours (by processing batches) but your code is much nicer and R-like than mine. However, running your code took much longer than mine. I think that making the batch size closer to its maximum allowed size makes it much faster than using a standard batch size of 10^6.
R can handle more then 2.14e+9 items in a matrix, so if for 151 bp reads, batch size could be 1.4172185e+7.
It might be nice to make that the default, like if(is.null(batch_size){# calculate optimum batch_size}
.
If you agree, you can push your code, and I can adapt it in that way. (I'll do my best to write nice and R-like code.)
there is batch_size argument that you can adjust in amplicanPipeline
function, I will reject your pull request and push my changes as they are tested with devtools. However this will be available only in the devel branch of the amplican and in couple days, when changes propagate through Bioconductor. For now you can use the version you installed with the .tar or use the github to install latest package with devtools:::github_install
Thanks for improving ampliCan!
Thank you so much for looking into my issue right away, for helping me debug it in such a friendly way and, of course, for making ampliCan available in the first place, improving and maintaining it!
Hello,
AmpliCon's alignment part of
amplicanPipeline
works fine for most files but for larger fastq files, I get the following error (on Ubuntu 20.04.3 LTS with R v4.1.1):I think this StackOverflow post could help solve it, but might be missing something...