Long vector error for large fastq files

florencebejo commented 2 years ago

Hello,

AmpliCon's alignment part of amplicanPipeline works fine for most files but for larger fastq files, I get the following error (on Ubuntu 20.04.3 LTS with R v4.1.1):

Aligning reads for 5a
Error: BiocParallel errors
  element index: 1, 2, 3
  first error: long vectors not supported yet: memory.c:3846

I think this StackOverflow post could help solve it, but might be missing something...

JokingHero commented 2 years ago

Can you try to run this large file without BiocParallel, run it on single core? This might work as in R things are not shared between processes, they are copied over.

florencebejo commented 2 years ago

Thank you @JokingHero for the quick answer. I used the amplicanPipeline function with the default use_parallel = FALSE. Do I need to do something else to deactivate BiocParallel?

Edit: I see in the code of amplicanAlign that use_parallel = FALSE does not completely skip the BiocParallel package but sets up the evaluation of the function makeAlignment by BiocParallel::bplapply to be serial.

p <- if (!use_parallel) {
        BiocParallel::SerialParam()
    }
    else {
        BiocParallel::bpparam()
    }

JokingHero commented 2 years ago

This is the "recommended" way to disable parallelism using BiocParallel, but the source of the error I suspect comes from the alignment function which is Biostrings::pairwiseAlignment, can you confirm? Maybe you can pinpoint which line is crashing?

How big is your fastq file?

florencebejo commented 2 years ago

This is the "recommended" way to disable parallelism using BiocParallel

Indeed, but I just wanted to be sure it was what you meant with "without BiocParallel".

I suspect comes from the alignment function which is Biostrings::pairwiseAlignment, can you confirm?

From traceback(), I only get this:

5: stop(.error_bplist(res))
4: BiocParallel::bplapply(configSplit, FUN = makeAlignment, average_quality, 
       min_quality, scoring_matrix, gap_opening, gap_extension, 
       fastqfiles, primer_mismatch, donor_mismatch, BPPARAM = p)
3: BiocParallel::bplapply(configSplit, FUN = makeAlignment, average_quality, 
       min_quality, scoring_matrix, gap_opening, gap_extension, 
       fastqfiles, primer_mismatch, donor_mismatch, BPPARAM = p)
2: amplicanAlign(config = config, fastq_folder = fastq_folder, use_parallel = use_parallel, 
       average_quality = average_quality, scoring_matrix = scoring_matrix, 
       gap_opening = gap_opening, gap_extension = gap_extension, 
       min_quality = min_quality, fastqfiles = fastqfiles, primer_mismatch = primer_mismatch, 
       donor_mismatch = donor_mismatch)

I am not sure how to get more information about it...

How big is your fastq file?

The fastq files are around 6 Gb (~69 million rows). As I am running the pipeline for someone else, I do not have all the technical details about the files (how they were generated, ...) but I could get more info if needed.

JokingHero commented 2 years ago

It will take quite some time for me to find time to look at this. Maybe if you find the time you could try to debug to the lowest function that errors or even create pull request with the fix?

florencebejo commented 2 years ago

I will do my best to get as much information as possible on the error. I am however not sure I will manage to fix it myself. If I do, I will of course create a pull request but I would appreciate any help you or others can provide with this issue.

florencebejo commented 2 years ago

I replaced BiocParallel::bplapply by the base R lapply to get the full traceback for the error (see below) and it confirms my initial suspicion: it tries to create a R matrix that is larger than what R can handle within this data type (i.e., ~2.147e+9 elements).

8: asMethod(object)
7: methods::as(quality(reads), "matrix")
6: matrixStats::rowMins(methods::as(quality(reads), "matrix"), na.rm = TRUE) at helpers_filters.R#153
5: goodBaseQuality(fwdT, min = min_quality) at helpers_alignment.R#167
4: FUN(X[[i]], ...)
3: lapply(configSplit, FUN = makeAlignment, average_quality, min_quality, 
       scoring_matrix, gap_opening, gap_extension, fastqfiles, primer_mismatch, 
       donor_mismatch)
2: lapply(configSplit, FUN = makeAlignment, average_quality, min_quality, 
       scoring_matrix, gap_opening, gap_extension, fastqfiles, primer_mismatch, 
       donor_mismatch) at #57
1: my_amplicanAlign(config_path, fastq_folder)

JokingHero commented 2 years ago

amplican_1.15.1.tar.gz

I have added batch processing for fastq filtering, which is the line that breaks for you. Can you check whether it works?

florencebejo commented 2 years ago

Sorry, I was writing the pull request when you sent your code... 😅 Tested it and your fix seems to work as well. I took a similar approach to yours (by processing batches) but your code is much nicer and R-like than mine. However, running your code took much longer than mine. I think that making the batch size closer to its maximum allowed size makes it much faster than using a standard batch size of 10^6.

R can handle more then 2.14e+9 items in a matrix, so if for 151 bp reads, batch size could be 1.4172185e+7. It might be nice to make that the default, like if(is.null(batch_size){# calculate optimum batch_size}.

If you agree, you can push your code, and I can adapt it in that way. (I'll do my best to write nice and R-like code.)

JokingHero commented 2 years ago

there is batch_size argument that you can adjust in amplicanPipeline function, I will reject your pull request and push my changes as they are tested with devtools. However this will be available only in the devel branch of the amplican and in couple days, when changes propagate through Bioconductor. For now you can use the version you installed with the .tar or use the github to install latest package with devtools:::github_install

Thanks for improving ampliCan!

florencebejo commented 2 years ago

Thank you so much for looking into my issue right away, for helping me debug it in such a friendly way and, of course, for making ampliCan available in the first place, improving and maintaining it!

valenlab / amplican

Long vector error for large fastq files #11