harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

sambamba error #165

Closed jasonwjohns closed 3 months ago

jasonwjohns commented 3 months ago

Hi guys,

Sorry to be back so soon. After successfully getting the workflow to run with one sample with very low coverage this morning, I ran it again on two samples. These samples are a more comparable size (~3Gb fastq.gz's) to what I will be running in my upcoming job. The workflow ran great, all the way to sambamba markdup when I got the below error for each file.

This is essentially a wild guess, but maybe it could have something to do with the fact that I manually input phony SRR values in the 'Run' column of the sample sheet - '1' and '2'? I may be grasping at straws there.

[Thu Mar  7 14:49:18 2024]
Error in rule dedup:
    jobid: 13
    input: results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam.bai
    output: results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam.bai
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.HMR.176.txt (check log file(s) for error details)
    conda-env: /Users/johns/test2/.snakemake/conda/58112a9020cd88eaf08ff65323aecdd4_
    shell:
        sambamba markdup -t 1 results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.HMR.176/2.bam results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.HMR.176_final.bam 2> logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.HMR.176.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Thu Mar  7 14:50:07 2024]
Error in rule dedup:
    jobid: 9
    input: results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam.bai
    output: results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam, results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam.bai
    log: logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.CL.39.txt (check log file(s) for error details)
    conda-env: /Users/johns/test2/.snakemake/conda/58112a9020cd88eaf08ff65323aecdd4_
    shell:
        sambamba markdup -t 1 results/dmAquExim1.NCBI.p_ctg.fasta/bams/preMerge/Aj.CL.39/1.bam results/dmAquExim1.NCBI.p_ctg.fasta/bams/Aj.CL.39_final.bam 2> logs/dmAquExim1.NCBI.p_ctg.fasta/sambamba_dedup/Aj.CL.39.txt
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-03-07T143624.867501.snakemake.log
sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.19.0 / DMD v2.089.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
sambamba-markdup: Cannot open file `/var/folders/xn/7zzgchrs3n749xgsx_qxz6vm0000gt/T/sambamba-pid17229-markdup-jmqr/PairedEndsInfocopa2' in mode `w+' (Too many open files)

Any ideas?

Thank you! Jason

cademirch commented 3 months ago

This is a new one for me. Quick search says that adding the CLI option --overflow-list-size 600000 could help.

Could you try this? You would need to add this here: https://github.com/harvardinformatics/snpArcher/blob/f0fe7966745700e78943dc59d50d35402cbdb531/workflow/rules/fastq2bam.smk#L58

If this works, you can open a PR, or let me know and I can add it.

jasonwjohns commented 3 months ago

Hi Cade,

I tried adjusting the overflow-list-size to 600000 with the same issue, so I then tried changing it to 10,000,000 almost as a dare to the computer, and it worked! That may be overkill, but it allowed the markdup job to run. Probably best that you just make the change to the .smk file so I don't do anything inadvertent. I used:

shell:
        "sambamba markdup --overflow-list-size 10000000 -t {threads} {input.bam} {output.dedupBam} 2> {log}"

I thought I'd wait to give you the update until the job had finished, in case anything else fouled, but the haplotype caller has been running for ~15 hours now and I'm not sure how much longer it will run. Not that I have any real basis to be surprised, since I don't know the ins and outs of the workflow, but I was surprised that the genome was only split into 9 gvcf intervals, when it was split into 50+ intervals for the alignment step. Does this sound ~right to you?

The computer I'm running it on has 192G of memory and 24 cores (16 performance + 8 efficiency). I allocated 23 cores to the job. My two input fastq.gz were ~3G each with genome sizes ~350Mb. The reference genome I used is scaffold level, so maybe that was part of it, or maybe 9 intervals would be expected in this situation. Really appreciate your time and expertise!

cademirch commented 3 months ago

Wow, glad it worked. Curious how big the BAMs are going into dedup. Is this public data? If so could you share? I'd be interested to see if I can replicate this. Your config.yaml and sample.csv might also be useful.

50+ intervals for the alignment step does not sound right to me, as we do not scatter the alignment step over intervals. I'm not sure how the alignment step would get split over intervals, so if you share the dry-run summary of your workflow that might help me understand whats going on.

9 GVCF intervals does sound in the range, though this is heavily depending on the contiguity of your reference genome and the config setting min_nmer.

jasonwjohns commented 3 months ago

Unfortunately the resequencing data aren't yet public, but I've uploaded the fastq.gz files for the two samples I used plus the reference to a Drive folder. The reference is what was uploaded to NCBI from the CCGP. I also uploaded the config.yaml and sample.csv, which I called test.csv. I dropped the logs folder in there as well, in its current state.

I may have misunderstood the term 'interval' for the alignment step. What I was referring to was the number of pre-merged bams. I thought all of these pre merged bams for each individual were a result of splitting the genome into intervals but it sounds like that's not right. Sounds like the intervals are just for the haplotypecaller step?

Not sure how large the bams were going into dedup. My impression is that the pre dedup bams are removed and replaced with _final.bam so I couldn't go in and check? The _final.bams are ~6.5G, if that's helpful.

I also added a .txt file called dry_run.txt to the Drive where I just copied and pasted the output from a dry run. Let me know if you'd like to see it a different way.

Thanks Cade!

cademirch commented 3 months ago

Great, I'll see if I can replicate this. Thanks for uploading that.

As for the intervals - yes intervals are only for BAM->GVCF->VCF. We have a detailed description of how we generate intervals in the paper.

jasonwjohns commented 3 months ago

Sounds great. Thank you for doing that.

I think I get it now. As you can probably tell, a lot of this stuff just skims the top of my head as it sails right over it. I appreciate your patience and time.

cademirch commented 3 months ago

Hi Jason,

I ran your data thru snpArcher on a Linux server running Ubuntu 22.04.4 and got no errors on the dedup step. I believe the issue you encountered is likely due to the ulimit -n setting (which controls limit of open file descriptors) on MacOS.

On the Ubuntu machine:

$ ulimit -n
1024

Where as on my M2 Macbook Pro:

$ ulimit -n
256

I believe you can adjust this value by entering the command ulimit -n 1024, this would up the limit to 1024 open files descriptors. However, I'm not familiar with adjusting this setting, so I would encourage you to look into it before proceeding.

jasonwjohns commented 3 months ago

Hi Cade,

Thanks very much for clarifying that. I got the workflow to run on the two test samples using our lab computer and it seemed to go all the way through, although it didn't produce QC files, even though the two samples had the same reference genome. Maybe there has to be more than two samples? I imagine you had the same output when you ran it.

The workflow ran in ~52 hours on our lab computer with 192GB of memory and 23/24 cores used. When I run the workflow for real it will be on the UCSC Hummingbird cluster with files of similar size, although the reference genomes will be different, and potentially more fragmented. I have a request in with the Hummingbird IT to help guide me with which partition to use, how many cores and how much memory to request. I'm wondering if there are any parameters on the config.yaml end that you would adjust to get it to run faster, or is it mostly just a matter of more computing power?

Thanks!

tsackton commented 3 months ago

@jasonwjohns generally snpArcher is optimized to run many single-core jobs in parallel. I think the default number of current jobs is something around 1000, which you probably shouldn't need to change. You will need to change the partition you submit to, ideally something with at least a 3 day run time.

I think the Hummingbird cluster uses the slurm job management system; we have a profile for slurm built into snakemake (profiles/slurm), but if you or your IT colleagues have a Snakemake profile optimized for the Hummingbird cluster, that would probably be ideal (although you'll need to make sure that use-conda is set to True). Note that we are still working on adapting our profile system to Snakemake 8, which makes some big changes.

cademirch commented 3 months ago

Hi Jason,

That's odd - I just checked and you are right - my run did not produce the QC html. I will look into this.

EDIT: Number of samples has to be greater than 2. I misspoke on that earlier. In any case, if you want to run the QC module now that the workflow has completed, you can do so like this:

snakemake -s snpArcher/workflow/modules/qc/Snakefile

Just check that the dry-run makes sense should be something like this:

Building DAG of jobs...
Job stats:
job                     count
--------------------  -------
admixture                   1
all                         1
check_fai                   1
plink                       1
qc_plots                    1
subsample_snps              1
vcftools_individuals        1
total                       7

As for Hummingbird, and settings there, let me get back to you. I'm currently working on refactoring some of the code to clean up how we specify computational resources. I'll also be updating the docs about this, as well as using SLURM (which Hummingbird uses).

jasonwjohns commented 3 months ago

Hi Tim and Cade,

This is all great. I so appreciate all of your personalized help, not to mention all the work you did to write the workflow. Cheers to you!

Jason