Closed cnk113 closed 2 years ago
Please could you provide the command you're running and the Octopus version you're using?
octopus -C cell -R /media/chang/HDD-1/reference_cellranger/refdata-cellranger-arc-GRCh38-mt-2020-A-2.0.0/fasta/genome.fa -I subset_tag3.bam -o cells.vcf --sequence-error-model PCR.NOVASEQ
octopus v0.7.4 (develop dd4bdc7e)
Could you add --debug
to your command and post the resulting log file?
Yeah I'll let it run with debug, and it'll be automatically killed when it runs out of memory.
Can you confirm this file has just 141 lines and hasn't been truncated? If so, this will be difficult to diagnose without the raw data - it looks like the run is failing before there's even a chance to write to the log. Is it possible for you to share the BAM file?
Yah the bam file is 92 Mb, I could share it on google drive with you. I'm assuming the email on this page is fine: https://www.imm.ox.ac.uk/people/daniel-cooke ?
Great - yes that email is fine. Thanks.
Shared it. I have a feeling it may be due the number of RG I have per SM. If so I can just collapse it all per cell.
You have 6771 samples in the BAM (not 101):
$ samtools view -H subset_tag3.bam | grep '^@RG' | awk '{print substr($3,4);}' | sort | uniq | wc -l
6771
Initially there are that many cells but I've been subsetting the bam file. I'm assuming I need to subset the RG/SM as well?
Yes - Octopus assumes all samples in the input BAMs (from the SM
tags) are to be used for calling unless told otherwise (even if there are no reads for a sample). You can specify a subset of samples to be called with the --sample
or --sample-file
options.
It turns out the memory issue you observed is due to the input read profiler, which by default tries to sample reads for every sample. I think this can probably be considered a bug, but it turns out it's easy enough to bypass with the --use-same-read-profile-for-all-samples
option.
I see, would you expect performance (accuracy) issues with same read profiling, in context of the cell caller? In the mean time I'll try running both. I'm seeing ~200 gb w/ my 101 cells after removal of the extra SM tags but w/o same read profiles which is more doable. It also took 5h for initializing w/o same read profiles vs. 2 minutes w/ same read profiles.
I finished running on the 101 cells, it took ~3 days and ~70-80 memory. However I noticed it was always using 1 core, I didn't set any thread limit. Is this normal?
Also, I'm trying to assign the somatic variants to cells, however since octopus does realignment/custom duplicate marking would it be possible if there was an additional where each somatic variant to link the SM? EDIT: Ah nvm, I can probably take the realigned bam
You can set the number of threads to use with the --threads
option.
Octopus reports somatic variants and sample genotypes in the output VCF - there's no need to look at any BAMs.
Ah I assumed not setting the thread limit allowed it to use the max number of threads on the system. Just to clarify, the output VCF is in the same order as the header file in the bam? Or if supplying a --sample-file then follow the order in that?
Also on the question of realignment, if I passed in the --sample-file will it realign reads from cells not listed in the file? I ask since I have much more cells that can feasibly inputted into octopus, so I'd rather just intersect with the somatic variants detected in the subset.
octopus_debug.log Actually I get an error when running multi-threaded.
[2021-05-31 13:09:18] <INFO> Invoked calling model: cell
[2021-05-31 13:09:18] <INFO> Processing 3,099,750,718bp with 32 threads (128 cores detected)
[2021-05-31 13:09:18] <INFO> Writing filtered calls to "/media/chang/HDD-6/chang/mg/atac/Adult_MG1_new_cr/outs/cells2.vcf"
[2021-05-31 13:09:18] <WARN> Running in parallel mode can make debug log difficult to interpret
[2021-05-31 13:15:05] <INFO> ------------------------------------------------------------------------
[2021-05-31 13:15:05] <INFO> current | | time | estimated
[2021-05-31 13:15:05] <INFO> position | completed | taken | ttc
[2021-05-31 13:15:05] <INFO> ------------------------------------------------------------------------
[2021-05-31 13:15:14] <EROR> Encountered a problem whilst calling chr1:959605-1399541(_Map_base::at)
[2021-05-31 13:15:18] <EROR> Encountered a problem whilst calling chr1:1399541-1881024(_Map_base::at)
[2021-05-31 13:15:22] <EROR> Encountered a problem whilst calling chr1:1881024-2227917(_Map_base::at)
[2021-05-31 13:15:26] <EROR> Encountered a problem whilst calling chr1:2227917-2586149(_Map_base::at)
[2021-05-31 13:15:29] <EROR> Encountered a problem whilst calling chr1:2586149-3972554(_Map_base::at)
[2021-05-31 13:15:33] <EROR> Encountered a problem whilst calling chr1:3972554-6591450(_Map_base::at)
[2021-05-31 13:15:37] <EROR> Encountered a problem whilst calling chr1:6591450-7953878(_Map_base::at)
[2021-05-31 13:15:40] <EROR> Encountered a problem whilst calling chr1:8864160-9473229(_Map_base::at)
[2021-05-31 13:15:44] <EROR> Encountered a problem whilst calling chr1:7953878-8864160(_Map_base::at)
[2021-05-31 13:15:48] <EROR> Encountered a problem whilst calling chr1:9473229-9950723(_Map_base::at)
[2021-05-31 13:15:51] <EROR> Encountered a problem whilst calling chr1:9950723-11157971(_Map_base::at)
[2021-05-31 13:15:55] <EROR> Encountered a problem whilst calling chr1:11157971-11980261(_Map_base::at)
[2021-05-31 13:15:59] <EROR> Encountered a problem whilst calling chr1:11980261-12406567(_Map_base::at)
[2021-05-31 13:16:03] <EROR> Encountered a problem whilst calling chr1:12406567-15128658(_Map_base::at)
[2021-05-31 13:16:06] <EROR> Encountered a problem whilst calling chr1:15128658-15524591(_Map_base::at)
[2021-05-31 13:16:10] <EROR> Encountered a problem whilst calling chr1:15524591-15986140(_Map_base::at)
[2021-05-31 13:16:14] <EROR> Encountered a problem whilst calling chr1:15986140-16594728(_Map_base::at)
[2021-05-31 13:16:18] <EROR> Encountered a problem whilst calling chr1:16594728-16893929(_Map_base::at)
[2021-05-31 13:16:22] <EROR> Encountered a problem whilst calling chr1:16893929-17415620(_Map_base::at)
[2021-05-31 13:16:25] <EROR> Encountered a problem whilst calling chr1:17415620-18214604(_Map_base::at)
[2021-05-31 13:16:29] <EROR> Encountered a problem whilst calling chr1:18214604-19436889(_Map_base::at)
[2021-05-31 13:16:33] <EROR> Encountered a problem whilst calling chr1:19436889-20060076(_Map_base::at)
[2021-05-31 13:16:37] <EROR> Encountered a problem whilst calling chr1:20060076-20786685(_Map_base::at)
[2021-05-31 13:16:40] <EROR> Encountered a problem whilst calling chr1:20786685-22052640(_Map_base::at)
[2021-05-31 13:16:44] <EROR> Encountered a problem whilst calling chr1:22052640-22980491(_Map_base::at)
[2021-05-31 13:16:48] <EROR> Encountered a problem whilst calling chr1:22980491-23728566(_Map_base::at)
[2021-05-31 13:16:52] <EROR> Encountered a problem whilst calling chr1:23728566-24415730(_Map_base::at)
[2021-05-31 13:16:55] <EROR> Encountered a problem whilst calling chr1:24415730-25022851(_Map_base::at)
[2021-05-31 13:16:59] <EROR> Encountered a problem whilst calling chr1:25022851-25699690(_Map_base::at)
[2021-05-31 13:17:03] <EROR> Encountered a problem whilst calling chr1:25699690-26275038(_Map_base::at)
[2021-05-31 13:17:06] <EROR> Encountered a problem whilst calling chr1:26275038-26690351(_Map_base::at)
[2021-05-31 13:17:10] <EROR> Encountered a problem whilst calling chr1:0-959605(_Map_base::at)
New problem should be fixed in 815f4f0515bcb86d6f61fc21a24986fcdfadb9e4.
Thanks it works now, although I had to set both cache and buffer to half their defaults to parallelize on 183 cells w/ 32 threads, it's currently using ~500 gb of memory.
Just to reiterate the genotype information is sorted by sample name? And separately, the read realigner applies to all reads, regardless of sample/cell?
Thanks it works now, although I had to set both cache and buffer to half their defaults to parallelize on 183 cells w/ 32 threads, it's currently using ~500 gb of memory.
This is certainly a performance bug - it shouldn't be using anywhere near that amount of memory. I'll try to look into that when I have a chance.
Just to reiterate the genotype information is sorted by sample name?
Yes - you can see this in the VCF header.
And separately, the read realigner applies to all reads, regardless of sample/cell?
The --bamout
feature can currently only be used with single-sample BAM files - you should see a warning if you try to use it with multi-sample BAMs.
I've been trying to get the cell caller working on scATAC-seq dataset, but I've been running into memory limitation issues on my end. I've tried to run 101 cells w/ a minimum of 500k read pairs per cell but I'm running out of memory on the server. My node has 1TB of memory.