valenlab / amplican

9 stars 4 forks source link

C stack limit #5

Closed rspreafico-vir closed 5 years ago

rspreafico-vir commented 5 years ago

Hi there, I am running amplican on a machine with more than 120 Gb of RAM on 11 samples, each being oversequenced with 500,000 PE reads/sample. A good fraction of reads get filtered out though. Still, I get this error message:

Error: C stack usage  13965658 is too close to the limit

If I am even stricter with quality filters, killing more reads, I don't get the error message. While I could downsample reads before feeding to amplican, I noticed CrispRVariants does not show similar memory issues, so this is something that could be possibly fixed?

Thank you

JokingHero commented 5 years ago

Why not try to set C stack limit to unlimited? ulimit unlimited

You can check current C stack limit with: ulimit -s

Also, CirispRVariants is not performing any alignments as far as I know, so this step is skipped, while we do NW alignments optimized for CRISPR expeirments, and this is the step that is usually requiring a lot of memory. 500k PE reads per sample seems like a big overkill, if this is for one target locus ^^

rspreafico-vir commented 5 years ago

It definitely is an overkill - not controlling the experiment unfortunately. Yes, CrispRVariants does not perform alignments, but was wondering whether there is an option to perform alignments in batches of sequences. Even without hitting the C stack limit, memory consumption is pretty high. I will try to set the ulimit as you suggested - thanks for the advice!

rspreafico-vir commented 5 years ago

Still getting an error even after setting ulimit to unlimited. In general, given that alignments are pairwise and therefore independent from each other, memory consumption seem really high. CRISPResso2 seems to use a similar strategy for alignment to ampliCan but the memory load is only a fraction.

Error: C stack usage  16377819 is too close to the limit
JokingHero commented 5 years ago

It's also possible to use bam files in amplican after alignments using bwa or any other aligner. This is not incorporated into main pipeline, as I only used this for some specific purposes and didn't had time to test whether everything will work fine downstream. Anyway, if you want to try you can do something like:

  1. runnning EMBOSS needle for NW alignments and then load events using ?amplican::pairToEvents but I dont have any example code for this.

  2. Alignment with bwa and then make sure you have extended CIGARS e.g. Align: bwa mem hg38 Sforward.fastq.gz reverse.fastq.gz > alignment.sam Make dictionary of the genome using picard: java -jar picard.jar CreateSequenceDictionary R=hg38.fa O=hg38.dict Make extended cigar using jvarkit: java -jar jvarkit/dist/samfixcigar.jar -r alignment.sam -o alignment_with_extcig.sam Make SAM to BAM and then, Transform SAM alignments into events:

    
    library(data.table)
    library(GenomicAlignments)
    library(GenomicRanges)
    library(Rsamtools)

bam_file <- "alignment_with_extcig.bam"

param <- ScanBamParam(flag = scanBamFlag(isDuplicate = NA, isSecondaryAlignment = FALSE, isProperPair = TRUE), tag = "MD", what = c("mapq", "seq", "pos")) aln <- readGAlignmentPairs(bam_file, param = param, strandMode = 1) aln <- unlist(aln, use.names = FALSE) # smart unlist, every second is reverse aln <- as.data.table(aln) aln$read_id <- rep(seq_len(length(aln)/2), each = 2) aln$Counts <- 1

check cigars

cot <- cigarOpTable(aln$cigar) message("CIGAR table: ") print(colSums(cot)/nrow(cot)) message("Top 5 freq CIGARS: ") print(sort(table(aln$cigar), decreasing=TRUE)[1:5])

config <- fread("amplican_config.csv") aln$ref <- config$amplicon[match(aln$seqnames, config$ID)] events <- amplican::cigarsToEvents(aln$cigar, aln$pos, aln$seq, aln$ref, aln$read_id, aln$mapq, aln$seqnames, aln$strand, aln$Counts)

example

amplican::cigarsToEvents( cigars = c("3M1I3M1D5M", "2M1X1M2I5M3D5M4I4M1X6M1D1M1X", "2M3X2M"), aln_pos_start = c(5, 1, 1), query_seq = c("ACTAGAATGGCT", "AGGGTAAAGTCCATGGCCCCAATTTGTGTGTAG", "AATCGAA"), ref = c("CCATACTGAACTGACTAAC", "AGTGAAGTCAAACATGGAATTAGTGTGTTAA", "AAAAAAA"), read_id = c(1, 2, 3), mapq = c(20, 60, 100), seqnames = c("t1", "t2", "t3"), strands = c("+", "+", "+"), counts = c(1, 1, 1))



And then you should be able to pipe it into the rest of amplicanPipeline function, but generally this I have not had time to properly test and might be prove to be a bit complicated for you, but there is nothing more to offer, I can't improve/change our main alignment just for your case at this time.

Thanks for bringing CRISPResso2 into my radar. It seems they can completely ignore my benchmarks as well as CRISPRvariant benchmarks and still have 0 tests whatsoever, but hey, they made new plot so its time to publish in that nat biotech. Quoting "Our modification to the algorithm incorporates the gap incentive vector GI, and also removes the possibility of following a gap in one sequence with a gap in the next sequence, which is the biological equivalent of observing an insertion and a deletion at the same basepair." - this can't be true, after deletion created by nuclease there is repair which can result in insertions, therefore there can be deletion covered by insertion, which from the aligner perspective would look exactly like gap in one sequence followed by gap in the other.
rspreafico-vir commented 5 years ago

I'd like to stay away from BWA as one point well taken from the ampliCan paper is that an alignment strategy optimized for expectations of CRISPR-driven mismatches between subject and query is better than a general-purpose aligner, so I am hoping to use ampliCan as-is even for slightly larger dataset - if there is a way that memory consumption can be kept at bay.

Re CRISPResso2, if you already have a benchmarking platform in place, I would be curious to see (preview?) whether accuracy improved and is now closer to ampliCan/CrispRVariants, or whether ampliCan has a clear accuracy edge over CRISPResso2. One specific point I guess of interest to many is the ability to handle larger indels (> 10nt) that seem to occur at a higher frequency than previously appreciated. If it helps, I showed results from ampliCan to my fellows and they were impressed.

JokingHero commented 5 years ago

I will check whether I can implement some batching, when there is so many reads, but its possibile it is not the alignement problem but some other step. I can't promise it will be solved anytime soon.

As for benchmarks, this will also probably happen in the future, but not very soon. I will message you when its in place.

rspreafico-vir commented 5 years ago

Appreciate the help, and understand that these changes cannot be fast. Thanks so much!

JokingHero commented 5 years ago

Have you tried running amplican with 1 core after "ulimit unlimited"? In R for every core you have to pay with huge memory usage as everything is copied over, maybe this is part of the problem, currently I have problem in reproducing this error.

rspreafico-vir commented 5 years ago

Good suggestion, thanks, will look into this!

JokingHero commented 5 years ago

I just have benchmarked CRISPResso2 on one of my benchmarks, see main github page https://github.com/valenlab/amplican, I have put the plot there.

Have you tested with 1 core?

rspreafico-vir commented 5 years ago

Thanks for updating the benchmarks, that is useful! I haven't had the chance to run with 1 core yet (waiting for a new dataset to come in as I am tight on time to just rerun previously analyzed datasets). I'd say we can close the Github issue though as that will be the way forward, and I will keep it in mind. However, if you are planning optimizations for memory consumption and/or speed, please feel free to post back here. I will be happy to get notified. Thank you!