Closed mcahn closed 11 months ago
My only suggestion without knowing more is to either reduce the chunksize, or change the backend to cython... 200Gb memory and running for so long is also way too much (how many reads are there). Normally the symptoms are different, but probably you have very high duplication rate?
But if you shared some more information (how to you generate the input file? what is the exact command you are running?) it might help, if there is some other issue.
Hi Phlya,
Thanks for the quick response!
We run pairtools in a snakemake pipeline:
configfile: 'config.yaml'
localrules: all
SAMPLES = config['samples']
STAR = expand('{sample}.bam'.split(), sample = SAMPLES)
PAIR = expand('{sample}_parsed.pairsam.gz {sample}_sorted.pairsam.gz \
{sample}_deduped.pairsam.gz {sample}_filtered.pairsam.gz \
{sample}_output.pairs.gz'.split(), sample = SAMPLES)
rule all:
input:
STAR + PAIR
rule star:
input:
r1 = '{sample}-read-1.fastq',
r2 = '{sample}-read-4.fastq'
output:
'{sample}.bam'
shell:
"""
module load samtools/1.8
bwa mem -SP5M -t8 /home/wk9698/dm6_index/dm6 \
{input.r1} {input.r2} | samtools view -bhS - > {output}
"""
rule parse:
input:
"{sample}.bam"
output:
"{sample}_parsed.pairsam.gz"
conda:
"microC_processing"
shell:
"""
module load samtools/1.8
samtools view -h {input} | \
pairtools parse -c /home/wk9698/dm6_index/dm6.chrom.sizes -o {output}
"""
rule sort:
input:
"{sample}_parsed.pairsam.gz"
output:
"{sample}_sorted.pairsam.gz"
conda:
"microC_processing"
shell:
"""
pairtools sort --nproc 8 --tmpdir ./tmp -o {output} {input}
"""
rule dedup:
input:
"{sample}_sorted.pairsam.gz"
output:
"{sample}_deduped.pairsam.gz"
conda:
"microC_processing"
shell:
"""
pairtools dedup --mark-dups -o {output} {input}
"""
rule select:
input:
"{sample}_deduped.pairsam.gz"
output:
"{sample}_filtered.pairsam.gz"
conda:
"microC_processing"
shell:
"""
pairtools select '(pair_type == "UU") or (pair_type == "UR") or (pair_type == "RU")' -o {output} {input}
"""
rule split:
input:
"{sample}_filtered.pairsam.gz"
output:
"{sample}_output.pairs.gz"
conda:
"microC_processing"
shell:
"""
pairtools split --output-pairs {output} {input}
"""
rule index:
input:
"{sample}_output.pairs.gz"
shell:
"""
pairix -f {input}
"""
The number of reads for the one we just tested was 647614798
The pipeline worked totally fine before with this same data (last month), but somehow it runs forever now with any data in the input. I tested multiple input data from different experiments that worked before as well, none of them works now.
Thanks again,
Wenfan
You are using conda environment simply by name... I think that's only recently available in snakemake, interesting. I'd avoid doing it if possible, it's too easy to change something.
I can recommend you to try our snakemake pipepine... It should do the same, but it's more flexible and comes with conda environments that work (at least for me). https://github.com/open2c/distiller-sm
Otherwise at first glance the pipeline looks fine. To make the dedup's job a bit easier and maybe speed it up/reduce memory consumption, you can try setting --max-mismatch to 0 or 1.
Thanks, Phlya, I'll take a shot at the pipeline and the environment you provided, and see if this fixes the issue.
A bit more info about the issue when looking at the memory usage:
This is the memory usage when it worked before:
This is the memory usage now on the same job: And it will stay there forever and never stop.
wondering if you have any more clues about what is going on,
Thanks,
Wenfan
Not to send you on a tangent, but could there be an issue with temporary files or writing permissions to output directories?
Not to send you on a tangent, but could there be an issue with temporary files or writing permissions to output directories?
The writing permission is granted for the output directories.
You are using conda environment simply by name... I think that's only recently available in snakemake, interesting. I'd avoid doing it if possible, it's too easy to change something.
I can recommend you to try our snakemake pipepine... It should do the same, but it's more flexible and comes with conda environments that work (at least for me). https://github.com/open2c/distiller-sm
Otherwise at first glance the pipeline looks fine. To make the dedup's job a bit easier and maybe speed it up/reduce memory consumption, you can try setting --max-mismatch to 0 or 1.
I am trying to use the distiller-sm pipeline you provided and got into an error message:
Using profile workflow/profiles/default and workflow specific profile workflow/profiles/default for setting default command line arguments.
Workflow defines that rule bwaindex is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
InputFunctionException in rule chunk_fastq in file /scratch/gpfs/SCHEDL/Wenfan/Working/test2/workflow/Snakefile, line 174:
Error:
IndexError: list index out of range
Wildcards:
library=2520__WK70__WK_microC_R1
run=lane1
Traceback:
File "/scratch/gpfs/SCHEDL/Wenfan/Working/test2/workflow/Snakefile", line 185, in <lambda>
File "/scratch/gpfs/SCHEDL/Wenfan/Working/test2/workflow/rules/common.smk", line 9, in needs_downloading
This is my config file:
#########################################
# THIS IS A TYPICAL config.yml TEMPLATE
# most of the settings present here
# are GO for mapping production data
# but nonetheless user must consider
# carefully every presented option
#########################################
#########################################
# When commmenting parameters out, make sure
# that each section still has at least one
# uncommented parameter, otherwise it
# will not get properly parsed.
#########################################
#######################################
# provide paths to your raw input data (fastq-s):
#######################################
# Fastqs can be provided as:
# -- a pairs of relative/absolute paths
# -- sra:<SRA_NUMBER>, optionally followed by the indices of the first and
# the last entry in the SRA in the form of "?start=<first>&end=<last>
# Alternatively, fastqs can be specified as either paths relative to the
# project folder or as absolute paths.
input:
raw_reads_paths:
2520__WK70__WK_microC_R1:
lane1:
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-1.fastq.gz
2520__WK70__WK_microC_R2:
lane1:
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-3.fastq.gz
library_groups:
2520__WK70__WK_microC:
- 2520__WK70__WK_microC_R1
- 2520__WK70__WK_microC_R2
truncate_fastq_reads: 0
genome:
assembly_name: 'dm6'
bwa_index_wildcard_path: '/home/wk9698/dm6_index/dm6.fa*'
chrom_sizes_path: '/home/wk9698/dm6_index/dm6.chrom.sizes'
do_fastqc: True
# Control how reads are mapped to the reference genomes.
map:
mapper: 'bwa-mem' #available: 'bwa-mem', 'bwa-mem2', 'bwa-meme'
# If 'chunksize' is non-zero, each input file gets split into multiple chunks,
# each mapped separately. Useful for mapping on clusters with many
# relatively weak nodes.
# The optimal chunk size is defined by the balance between mapping and merging.
# Smaller chunks (~30M) are better for clusters with many weak nodes,
# however, having >~10 chunks per run slow down merging.
chunksize: 30_000
# Specify extra BWA mapping options.
mapping_options: '-SP5M -t8'
# Specify fastp trim options.
#i.e. parameters
#--detect_adapter_for_pe -q 15
trim_options: ''
# A more technical option, use a custom script to split fastq files from SRA
# into two files, one per read side. By default it is true, which is
# faster (because we can use multi-threaded compression), but less
# stable. Set to false if you download files from SRA and bwa complains
# about unpaired reads.
use_custom_split: True
# Control how read alignments are converted ('parsed') into Hi-C pairs.
parse:
# If 'make_pairsam' is True, parsed Hi-C pairs will store complete
# alignment records in the SAM format (the resulting hybrid between the
# .pairs and .sam formats is called '.pairsam'). Such files can be useful for
# thorough investigation of Hi-C data. Downstream of parsing, pairsams
# are split into .pairs and .bam, and .bam alignments are tagged with
# Hi-C related information. 'make_pairsam' roughly doubles the storage
# and I/O requirements and should be used only when absolutely needed.
# NOTE: when 'make_pairsam' is False, the initial output of parsing is still
# called '.pairsam' despite missing SAM alignments, for technical reasons.
make_pairsam: True
# When 'make_pairsam' is True, enabling 'drop_seq' erases sequences and
# Phred scores from the SAM alignments in .pairsam and .bam output files.
# Enable to make lightweight .pairsam/.bam output.
# NOTE: when 'make_pairsam' is False, 'drop_seq' is ignored.
drop_seq: False
# Enable 'drop_readid' to drop readID from .pairs files to create
# lightweight .pairs files. This would prevent one from detecting
# optical/clustering duplicates during dedup.
# NOTE: does not affect alignment records in the .pairsam files and
# subsequently .bam files after .pairsam splitting.
drop_readid: False
# When 'keep_unparsed_bams' is True, distiller preserves the _immediate_
# output of bwa in a .bam format. Could be used as a faster alternative
# to 'make_pairsam' when alignments are needed, but tagging them with Hi-C
# related information is not necessary.
keep_unparsed_bams: True
# Pass extra options to pairtools parse, on top of the ones specified by
# flags 'make_pairsam', 'drop_readid', 'drop_seq'. The default value
# enables storing MAPQ scores in the .pairsam/.pairs output, which are
# used later for filtering/binning. The default walks-policy is 'mask'
# which masks complex walks in long reads.
parsing_options: '--add-columns mapq --walks-policy mask'
# Control how PCR/optical duplicates are detected in the data.
dedup:
# PCR/optical duplicates are detected as Hi-C pairs with matching locations
# on both sides. 'max_mismatch_bp' controls the maximal allowed mismatch in
# mapped locations on either side for two pairs to be still considered as
# duplicates.
max_mismatch_bp: 1
# Control how Hi-C pairs are binned into contact maps, stored in .cool files.
bin:
# Specify which resolutions should be included in the multi-resolution .cool file.
# The lowest (base) resolution _must_ be the common denominator of all other
# resolutions.
resolutions:
- 1000000
- 500000
- 250000
- 100000
- 50000
- 25000
- 10000
- 5000
- 2500
- 1000
- 500
- 250
- 100
# Specify if the multi-resolution .cool output files should be balanced.
balance: True
# Pass additional parameters to cooler balance:
# balance_options: ''
# Specify additional filters applied to pairs during binning.
# Multiple filters are allowed; for each filter, all pairs satisfying the
# given filter expression will be binned into a separate cooler.
# Filters are specified using the following syntax:
# {filter_name}: '{a valid Python expression}'
filters:
no_filter: ''
mapq_30: '(mapq1>=30) and (mapq2>=30)'
output:
dirs:
downloaded_fastqs: 'inputs/fastq/downloaded_fastqs'
fastqc: 'results/fastqc'
processed_fastqs: 'results/processed_fastqs'
mapped_parsed_sorted_chunks: 'results/mapped_parsed_sorted_chunks'
pairs_runs: 'results/pairs_runs'
pairs_library: 'results/pairs_library'
coolers_library: 'results/coolers_library'
coolers_library_group: 'results/coolers_library_group'
stats_library_group: 'results/stats_library_group'
# To use automatic upload to resgen, add your credentials to ~/.resgen/credentials
# (see https://docs-python.resgen.io/cli.html#logging-in)
resgen:
upload: False
user: test
project: test
I am wondering what is going wrong?
Thanks,
Wenfan
For paired reads you need to provide two files in the lane, one for each side. I think that's the problem.
Hi Phlya,
I did have two files, are you suggesting instead of having them separate in R1 and R2, put them in the same lane?
input:
raw_reads_paths:
2520__WK70__WK_microC_R1:
lane1:
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-1.fastq.gz
2520__WK70__WK_microC_R2:
lane1:
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-3.fastq.gz
Do you suggest making it like this?
input:
raw_reads_paths:
2520__WK70__WK_microC:
lane1:
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-1.fastq.gz
- /scratch/gpfs/SCHEDL/Wenfan/Working/test2/2520__WK70__WK_microC_20230328-read-3.fastq.gz
I guess I am confused about the "lane", "R1", "R2", and "library groups" specifications in your pipeline. How do they relate to let's say we have only 2 files from the pair-read sequence?
Exactly, 2 files from one lane.
Then you can have multiple lanes, that can form a library. They will be merged on the level of pairs and deduplicated together. One library is meant to represent a biological sample.
Libraries can form groups, for example to merge independent replicates on the level of cooler files.
So if I don't need to group my libraries at this stage, I can just ignore the "library_groups" and leave it blank right?
Yes, I think it should work if you just skip the whole library groups section
When trying to run your pipeline, there seems to be an issue about not having mamba:
Using profile workflow/profiles/default and workflow specific profile workflow/profiles/default for setting default command line arguments.
Workflow defines that rule bwaindex is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
CreateCondaEnvironmentException:
The 'mamba' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.The mamba package manager (https://github.com/mamba-org/mamba) is a fast and robust conda replacement. It is the recommended way of using Snakemake's conda integration. It can be installed with `conda install -n base -c conda-forge mamba`. If you still prefer to use conda, you can enforce that by setting `--conda-frontend conda`.
I added "--conda-frontend conda" into the script and got another error message:
Using profile workflow/profiles/default and workflow specific profile workflow/profiles/default for setting default command line arguments.
Workflow defines that rule bwaindex is eligible for caching between workflows (use the --cache argument to enable this).
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Creating conda environment workflow/envs/pairtools_cooler.yml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /scratch/gpfs/SCHEDL/Wenfan/Working/test2/workflow/envs/pairtools_cooler.yml:
Command:
conda env create --quiet --file "/scratch/gpfs/SCHEDL/Wenfan/Working/test2/.snakemake/conda/b4f8a2918bf5e2fc3395c68cb25fe7cc_.yaml" --prefix "/scratch/gpfs/SCHEDL/Wenfan/Working/test2/.snakemake/conda/b4f8a2918bf5e2fc3395c68cb25fe7cc_"
Output:
Collecting package metadata (repodata.json): ...working... failed
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/repodata.json>
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https//conda.anaconda.org/conda-forge/linux-64'
I recommend using mamba. It works much better. And do you have internet access on the computer where you are running the pipeline?..
Wenfan, I can install mamba into the environment you're using. The compute nodes on the cluster don't have internet access. We can try running on the head node. If that doesn't have enough memory then we'll switch to a non-cluster machine.
Matthew
On Fri, Nov 17, 2023 at 2:20 AM Ilya Flyamer @.***> wrote:
I recommend using mamba. It works much better. And do you have internet access on the computer where you are running the pipeline?..
— Reply to this email directly, view it on GitHub https://github.com/open2c/pairtools/issues/192#issuecomment-1815856884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5QNWVXMVJMEKSCAHZYXJ3YE4F5HAVCNFSM6AAAAAA7LLQOQGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVHA2TMOBYGQ . You are receiving this because you authored the thread.Message ID: @.***>
Matthew, I would hold on that a bit.
Phlya:
Do you know where I could possibly modify the scripts to fit it to Conda instead of mamba? I am concerned that most of my other analysis are in conda environments and if I put mamba on top of conda I don't know if anything strange would happen to other programs. Since conda is much more widely used than mamba, I would also suggest to have this posted in that pipeline description about how to modify it to conda so that it is more easy to use for the community as well.
On the other hand, is it necessary to have the access to the internet for the pipeline? I assume the only thing that needs internet is to install the environment, but once the environment is installed, it should not need internet. Do you know if I can change the script somewhere to remove the internet dependency. This may also be important to have in your pipeline description since most of the HPC in universities do not have access to interenet due to data security policies, if you could make that change it would also be easily adopted by people in the community working on university clusters.
Thanks,
Wenfan
Phlya,
I also reinstalled the environment using your environment on your pipeline to run my own pipeline with pairtools dedup, and I ran into the same issue again:
It stuck there and runs forever
My guess is this has something to do with the pairtools, not the environment. So even if I run your pipeline it may end up with the same thing since you also use pairtools dedup in your pipeline.
Any thoughts?
Thanks,
Wenfan
This is the environment I ran pairtools dedup:
name: microC_processing
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- aioeasywebdav=2.4.0=pyha770c72_0
- aiohttp=3.8.6=py310h2372a71_1
- aiosignal=1.3.1=pyhd8ed1ab_0
- amply=0.1.6=pyhd8ed1ab_0
- appdirs=1.4.4=pyh9f0ad1d_0
- asciitree=0.3.3=py_2
- async-timeout=4.0.3=pyhd8ed1ab_0
- attmap=0.13.2=pyhd8ed1ab_0
- attrs=23.1.0=pyh71513ae_1
- aws-c-auth=0.7.6=h37ad1db_0
- aws-c-cal=0.6.9=h3b91eb8_1
- aws-c-common=0.9.8=hd590300_0
- aws-c-compression=0.2.17=hfd9eb17_6
- aws-c-event-stream=0.3.2=hae413d4_6
- aws-c-http=0.7.14=h162056d_1
- aws-c-io=0.13.35=hc23c90e_8
- aws-c-mqtt=0.9.9=h1387108_0
- aws-c-s3=0.3.24=h7630044_0
- aws-c-sdkutils=0.1.12=hfd9eb17_5
- aws-checksums=0.1.17=hfd9eb17_5
- aws-crt-cpp=0.24.7=h4712614_1
- aws-sdk-cpp=1.11.182=h8beafcf_7
- backports=1.0=pyhd8ed1ab_3
- backports.functools_lru_cache=1.6.5=pyhd8ed1ab_0
- bcrypt=4.0.1=py310hcb5633a_1
- bioframe=0.5.1=pyhdfd78af_0
- biopython=1.81=py310h2372a71_1
- bokeh=3.3.1=pyhd8ed1ab_0
- brotli=1.1.0=hd590300_1
- brotli-bin=1.1.0=hd590300_1
- brotli-python=1.1.0=py310hc6cd4ac_1
- bzip2=1.0.8=hd590300_5
- c-ares=1.21.0=hd590300_0
- ca-certificates=2023.7.22=hbcca054_0
- cached-property=1.5.2=hd8ed1ab_1
- cached_property=1.5.2=pyha770c72_1
- cachetools=5.3.2=pyhd8ed1ab_0
- certifi=2023.7.22=pyhd8ed1ab_0
- cffi=1.16.0=py310h2fee648_0
- charset-normalizer=3.3.2=pyhd8ed1ab_0
- click=8.1.7=unix_pyh707e725_0
- cloudpickle=3.0.0=pyhd8ed1ab_0
- coin-or-cbc=2.10.10=h9002f0b_0
- coin-or-cgl=0.60.7=h516709c_0
- coin-or-clp=1.17.8=h1ee7a9c_0
- coin-or-osi=0.108.8=ha2443b9_0
- coin-or-utils=2.11.9=hee58242_0
- coincbc=2.10.10=0_metapackage
- colorama=0.4.6=pyhd8ed1ab_0
- configargparse=1.7=pyhd8ed1ab_0
- connection_pool=0.0.3=pyhd3deb0d_0
- contourpy=1.2.0=py310hd41b1e2_0
- cooler=0.9.3=pyhdfd78af_0
- coreutils=9.4=hd590300_0
- cryptography=41.0.5=py310h75e40e8_0
- curl=8.4.0=hca28451_0
- cycler=0.12.1=pyhd8ed1ab_0
- cytoolz=0.12.2=py310h2372a71_1
- dask=2023.11.0=pyhd8ed1ab_0
- dask-core=2023.11.0=pyhd8ed1ab_0
- datrie=0.8.2=py310h2372a71_7
- defusedxml=0.7.1=pyhd8ed1ab_0
- dill=0.3.7=pyhd8ed1ab_0
- distributed=2023.11.0=pyhd8ed1ab_0
- docutils=0.20.1=py310hff52083_2
- dpath=2.1.6=pyha770c72_0
- dropbox=11.36.2=pyhd8ed1ab_0
- eido=0.2.1=pyhd8ed1ab_0
- exceptiongroup=1.1.3=pyhd8ed1ab_0
- filechunkio=1.8=py_2
- fonttools=4.44.3=py310h2372a71_0
- freetype=2.12.1=h267a509_2
- frozenlist=1.4.0=py310h2372a71_1
- fsspec=2023.10.0=pyhca7485f_0
- ftputil=5.0.4=pyhd8ed1ab_0
- gflags=2.2.2=he1b5a44_1004
- gitdb=4.0.11=pyhd8ed1ab_0
- gitpython=3.1.40=pyhd8ed1ab_0
- glog=0.6.0=h6f12383_0
- google-api-core=2.14.0=pyhd8ed1ab_0
- google-api-python-client=2.108.0=pyhd8ed1ab_0
- google-auth=2.23.4=pyhca7485f_0
- google-auth-httplib2=0.1.1=pyhd8ed1ab_0
- google-cloud-core=2.3.3=pyhd8ed1ab_0
- google-cloud-storage=2.13.0=pyhca7485f_0
- google-crc32c=1.1.2=py310hc5c09a0_5
- google-resumable-media=2.6.0=pyhd8ed1ab_0
- googleapis-common-protos=1.61.0=pyhd8ed1ab_0
- grpcio=1.59.2=py310h1b8f574_0
- h5py=3.10.0=nompi_py310ha2ad45a_100
- hdf5=1.14.2=nompi_h4f84152_100
- htslib=1.18=h81da01d_0
- httplib2=0.22.0=pyhd8ed1ab_0
- humanfriendly=10.0=pyhd8ed1ab_6
- icu=70.1=h27087fc_0
- idna=3.4=pyhd8ed1ab_0
- importlib-metadata=6.8.0=pyha770c72_0
- importlib_metadata=6.8.0=hd8ed1ab_0
- importlib_resources=6.1.1=pyhd8ed1ab_0
- iniconfig=2.0.0=pyhd8ed1ab_0
- jinja2=3.1.2=pyhd8ed1ab_1
- jsonschema=4.20.0=pyhd8ed1ab_0
- jsonschema-specifications=2023.11.1=pyhd8ed1ab_0
- jupyter_core=5.5.0=py310hff52083_0
- keyutils=1.6.1=h166bdaf_0
- kiwisolver=1.4.5=py310hd41b1e2_1
- krb5=1.21.2=h659d440_0
- lcms2=2.15=h7f713cb_2
- ld_impl_linux-64=2.40=h41732ed_0
- lerc=4.0.0=h27087fc_0
- libabseil=20230802.1=cxx17_h59595ed_0
- libaec=1.1.2=h59595ed_1
- libarrow=12.0.1=hd1ba8c9_26_cpu
- libblas=3.9.0=19_linux64_openblas
- libbrotlicommon=1.1.0=hd590300_1
- libbrotlidec=1.1.0=hd590300_1
- libbrotlienc=1.1.0=hd590300_1
- libcblas=3.9.0=19_linux64_openblas
- libcrc32c=1.1.2=h9c3ff4c_0
- libcurl=8.4.0=hca28451_0
- libdeflate=1.18=h0b41bf4_0
- libedit=3.1.20191231=he28a2e2_2
- libev=4.33=h516909a_1
- libevent=2.1.12=hf998b51_1
- libffi=3.4.2=h7f98852_5
- libgcc-ng=13.2.0=h807b86a_3
- libgfortran-ng=13.2.0=h69a702a_3
- libgfortran5=13.2.0=ha4646dd_3
- libgomp=13.2.0=h807b86a_3
- libgoogle-cloud=2.12.0=h5206363_4
- libgrpc=1.59.2=hd6c4280_0
- libiconv=1.17=h166bdaf_0
- libjpeg-turbo=2.1.5.1=hd590300_1
- liblapack=3.9.0=19_linux64_openblas
- liblapacke=3.9.0=19_linux64_openblas
- libnghttp2=1.58.0=h47da74e_0
- libnsl=2.0.1=hd590300_0
- libnuma=2.0.16=h0b41bf4_1
- libopenblas=0.3.24=pthreads_h413a1c8_0
- libpng=1.6.39=h753d276_0
- libprotobuf=4.24.4=hf27288f_0
- libre2-11=2023.06.02=h7a70373_0
- libsodium=1.0.18=h36c2ea0_1
- libsqlite=3.44.0=h2797004_0
- libssh2=1.11.0=h0841786_0
- libstdcxx-ng=13.2.0=h7e041cc_3
- libthrift=0.19.0=hb90f79a_1
- libtiff=4.6.0=h8b53f26_0
- libutf8proc=2.8.0=h166bdaf_0
- libuuid=2.38.1=h0b41bf4_0
- libwebp-base=1.3.2=hd590300_0
- libxcb=1.15=h0b41bf4_0
- libxml2=2.9.14=h22db469_4
- libzlib=1.2.13=hd590300_5
- locket=1.0.0=pyhd8ed1ab_0
- logmuse=0.2.6=pyh8c360ce_0
- lz4=4.3.2=py310h350c4a5_1
- lz4-c=1.9.4=hcb278e6_0
- markdown-it-py=3.0.0=pyhd8ed1ab_0
- markupsafe=2.1.3=py310h2372a71_1
- matplotlib-base=3.8.1=py310h62c0568_0
- mdurl=0.1.0=pyhd8ed1ab_0
- msgpack-python=1.0.6=py310hd41b1e2_0
- multidict=6.0.4=py310h2372a71_1
- multiprocess=0.70.15=py310h2372a71_1
- munkres=1.1.4=pyh9f0ad1d_0
- nbformat=5.9.2=pyhd8ed1ab_0
- ncbi-vdb=3.0.8=hdbdd923_0
- ncurses=6.4=h59595ed_2
- numpy=1.23.0=py310h53a5b5f_0
- oauth2client=4.1.3=py_0
- openjpeg=2.5.0=h488ebb8_3
- openssl=3.1.4=hd590300_0
- orc=1.9.0=h4b38347_4
- ossuuid=1.6.2=hf484d3e_1000
- packaging=23.2=pyhd8ed1ab_0
- pairix=0.3.7=py310h83093d7_5
- pairtools=1.0.2=py310hb45ccb3_1
- pandas=2.1.3=py310hcc13569_0
- paramiko=3.3.1=pyhd8ed1ab_0
- partd=1.4.1=pyhd8ed1ab_0
- pbgzip=2016.08.04=h9d449c0_4
- peppy=0.35.7=pyhd8ed1ab_0
- perl=5.32.1=4_hd590300_perl5
- perl-alien-build=2.48=pl5321hec16e2b_0
- perl-alien-libxml2=0.17=pl5321hec16e2b_0
- perl-business-isbn=3.007=pl5321hd8ed1ab_0
- perl-business-isbn-data=20210112.006=pl5321hd8ed1ab_0
- perl-capture-tiny=0.48=pl5321ha770c72_1
- perl-carp=1.50=pl5321hd8ed1ab_0
- perl-constant=1.33=pl5321hd8ed1ab_0
- perl-exporter=5.74=pl5321hd8ed1ab_0
- perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0
- perl-ffi-checklib=0.28=pl5321hdfd78af_0
- perl-file-chdir=0.1011=pl5321hd8ed1ab_0
- perl-file-path=2.18=pl5321hd8ed1ab_0
- perl-file-temp=0.2304=pl5321hd8ed1ab_0
- perl-file-which=1.24=pl5321hd8ed1ab_0
- perl-importer=0.026=pl5321hd8ed1ab_0
- perl-parent=0.241=pl5321hd8ed1ab_0
- perl-path-tiny=0.124=pl5321hd8ed1ab_0
- perl-pathtools=3.75=pl5321h166bdaf_0
- perl-scope-guard=0.21=pl5321hd8ed1ab_0
- perl-sub-info=0.002=pl5321hd8ed1ab_0
- perl-term-table=0.016=pl5321hdfd78af_0
- perl-test-fatal=0.016=pl5321ha770c72_0
- perl-test-warnings=0.031=pl5321ha770c72_0
- perl-test2-suite=0.000145=pl5321hdfd78af_0
- perl-try-tiny=0.31=pl5321ha770c72_0
- perl-uri=5.17=pl5321ha770c72_0
- perl-xml-libxml=2.0207=pl5321h661654b_0
- perl-xml-namespacesupport=1.12=pl5321hd8ed1ab_0
- perl-xml-sax=1.02=pl5321hd8ed1ab_0
- perl-xml-sax-base=1.09=pl5321hd8ed1ab_0
- pillow=10.0.1=py310h29da1c1_1
- pip=23.3.1=pyhd8ed1ab_0
- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_1
- plac=1.4.1=pyhd8ed1ab_1
- platformdirs=4.0.0=pyhd8ed1ab_0
- pluggy=1.3.0=pyhd8ed1ab_0
- ply=3.11=py_1
- prettytable=3.9.0=pyhd8ed1ab_0
- protobuf=4.24.4=py310h620c231_0
- psutil=5.9.5=py310h2372a71_1
- pthread-stubs=0.4=h36c2ea0_1001
- pulp=2.7.0=py310hff52083_1
- pyarrow=12.0.1=py310hf9e7431_26_cpu
- pyarrow-hotfix=0.5=pyhd8ed1ab_0
- pyasn1=0.5.0=pyhd8ed1ab_0
- pyasn1-modules=0.3.0=pyhd8ed1ab_0
- pycparser=2.21=pyhd8ed1ab_0
- pyfaidx=0.7.2.2=pyhdfd78af_0
- pygments=2.16.1=pyhd8ed1ab_0
- pynacl=1.5.0=py310h2372a71_3
- pyopenssl=23.3.0=pyhd8ed1ab_0
- pyparsing=3.1.1=pyhd8ed1ab_0
- pysam=0.22.0=py310h41dec4a_0
- pysftp=0.2.9=py_1
- pysocks=1.7.1=pyha2e5f31_6
- pytest=7.4.3=pyhd8ed1ab_0
- python=3.10.0=h543edf9_3_cpython
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-fastjsonschema=2.19.0=pyhd8ed1ab_0
- python-irodsclient=1.1.9=pyhd8ed1ab_0
- python-tzdata=2023.3=pyhd8ed1ab_0
- python_abi=3.10=4_cp310
- pytz=2023.3.post1=pyhd8ed1ab_0
- pyu2f=0.1.5=pyhd8ed1ab_0
- pyvcf3=1.0.3=pyhdfd78af_0
- pyyaml=6.0.1=py310h2372a71_1
- rdma-core=28.9=h59595ed_1
- re2=2023.06.02=h2873b5e_0
- readline=8.2=h8228510_1
- referencing=0.31.0=pyhd8ed1ab_0
- requests=2.31.0=pyhd8ed1ab_0
- reretry=0.11.8=pyhd8ed1ab_0
- rich=13.7.0=pyhd8ed1ab_0
- rpds-py=0.13.0=py310hcb5633a_0
- rsa=4.9=pyhd8ed1ab_0
- s2n=1.3.56=h06160fa_0
- samtools=1.18=h50ea8bc_1
- scipy=1.11.3=py310hb13e2d6_1
- setuptools=68.2.2=pyhd8ed1ab_0
- setuptools-scm=8.0.4=pyhd8ed1ab_0
- simplejson=3.19.2=py310h2372a71_0
- six=1.16.0=pyh6c4a22f_0
- slacker=0.14.0=py_0
- smart_open=6.4.0=pyhd8ed1ab_0
- smmap=5.0.0=pyhd8ed1ab_0
- snakemake=7.32.4=hdfd78af_1
- snakemake-minimal=7.32.4=pyhdfd78af_1
- snappy=1.1.10=h9fff704_0
- sortedcontainers=2.4.0=pyhd8ed1ab_0
- sqlite=3.44.0=h2c6b66d_0
- sra-tools=3.0.8=h9f5acd7_0
- stone=3.3.1=pyhd8ed1ab_0
- stopit=1.1.2=py_0
- tabix=1.11=hdfd78af_0
- tabulate=0.9.0=pyhd8ed1ab_1
- tblib=2.0.0=pyhd8ed1ab_0
- throttler=1.2.2=pyhd8ed1ab_0
- tk=8.6.13=noxft_h4845f30_101
- tomli=2.0.1=pyhd8ed1ab_0
- toolz=0.12.0=pyhd8ed1ab_0
- toposort=1.10=pyhd8ed1ab_0
- tornado=6.3.3=py310h2372a71_1
- traitlets=5.13.0=pyhd8ed1ab_0
- typing-extensions=4.8.0=hd8ed1ab_0
- typing_extensions=4.8.0=pyha770c72_0
- tzdata=2023c=h71feb2d_0
- ubiquerg=0.6.3=pyhd8ed1ab_0
- ucx=1.15.0=h64cca9d_0
- unicodedata2=15.1.0=py310h2372a71_0
- uritemplate=4.1.1=pyhd8ed1ab_0
- urllib3=1.26.18=pyhd8ed1ab_0
- veracitools=0.1.3=py_0
- wcwidth=0.2.10=pyhd8ed1ab_0
- wheel=0.41.3=pyhd8ed1ab_0
- wrapt=1.16.0=py310h2372a71_0
- xorg-libxau=1.0.11=hd590300_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xyzservices=2023.10.1=pyhd8ed1ab_0
- xz=5.2.6=h166bdaf_0
- yaml=0.2.5=h7f98852_2
- yarl=1.9.2=py310h2372a71_1
- yte=1.5.1=pyha770c72_2
- zict=3.0.0=pyhd8ed1ab_0
- zipp=3.17.0=pyhd8ed1ab_0
- zlib=1.2.13=hd590300_5
- zstd=1.5.5=hfc55251_0
- pip:
- annotated-types==0.6.0
- asttokens==2.4.1
- blinker==1.7.0
- boto3==1.19.1
- botocore==1.22.12
- clodius==0.20.1
- comm==0.2.0
- decorator==5.1.1
- diskcache==5.6.3
- executing==2.0.1
- flask==3.0.0
- flask-cors==4.0.0
- fusepy==3.0.1
- higlass-python==0.4.8
- ipython==8.17.2
- ipywidgets==8.1.1
- itsdangerous==2.1.2
- jedi==0.19.1
- jmespath==0.10.0
- jupyterlab-widgets==3.0.9
- matplotlib-inline==0.1.6
- negspy==0.2.24
- parso==0.8.3
- pexpect==4.8.0
- prompt-toolkit==3.0.41
- ptyprocess==0.7.0
- pure-eval==0.2.2
- pybbi==0.3.5
- pydantic==2.5.1
- pydantic-core==2.14.3
- python-dotenv==0.12.0
- requests-unixsocket==0.3.0
- resgen-python==0.6.1
- s3transfer==0.5.2
- sh==2.0.6
- simple-httpfs==0.4.12
- slugid==2.0.0
- stack-data==0.6.3
- tenacity==8.2.3
- tqdm==4.66.1
- werkzeug==3.0.1
- widgetsnbextension==4.0.9
prefix: /home/wk9698/anaconda3/envs/microC_processing
I mean, mamba is the default recommended package manager for snakemake, and it just works so much better and faster than conda, I don't even know anyone still using conda... But they use exactly the same environments, so theoretically it should also work with conda. There is nothing to change in the pipeline for that. I just haven't tried with with conda, and solving the environment with it might take forever, I don't know.
It is necessary to have access to the internet to create the environments. Perhaps you can first run the pipeline with --create-conda-envs-only, and then submit the actual jobs to compute nodes... But that is something I haven't tried., and that is something I can't modify - it's managed by snakemake. Cluster with no internet access is very sad :( Very happy I haven't had to work on one like that so far.
Have you tried my original suggestions of using smaller chunksize or changing backend to cython? If the environment is fine, that is the best guess what can help with excessive memory consumption and failed deduplication.
Since you had a successful run with this data previously, what results did you get, what % duplication was there?
Just wonder how to specify the chunksize and changing backend to cython in the script I had:
rule dedup:
input:
"{sample}_sorted.pairsam.gz"
output:
"{sample}_deduped.pairsam.gz"
shell:
"""
pairtools dedup --mark-dups -o {output} {input}
"""
This is the duplication analysis with fastQC
https://pairtools.readthedocs.io/en/latest/cli_tools.html#pairtools-dedup here are the docs with the arguments
Try either --chunksize 10000 or --backend cython
Wondering what is the default chunksize if I don't specify it?
nvm, I saw it the default is 10 time more.
The dedup did finish the job with 10K chunksize! And with much lower memory usage. So it might be something to do with it having issue handling large chunksize in the system.
Just wondering how many duplicates it may miss with lower chunksize in your experience?
It shouldn't miss anything! So that's great news, thank you for reporting back. Perhaps we should lower the default then. In the data we used for testing there was no difference between these two chunk sizes (but at even lower 1000 it became significantly slower), but I guess with higher duplication rate it can make a huge difference.
Thanks for helping and explaining the concepts!
Happy to help!
We are running dedup in pairtools on a Linux cluster running:
A previous job run October 15th (a month ago) ran for 12 hours 15 minutes and worked. However, without any changes that we're aware of, another job using the same data and the same script and config files seems to run forever (actually four days) without ending. Unlike the job that worked, for which the memory utilization went up and down about hourly, the failed job seemed to use about 200 GB of memory constantly, without making any progress.
On the attempt we are currently running, the pbgzip process is using 135% of a CPU core, and the pairtools process is using 100%, but there doesn't seem to have been any output in the last few hours.
Can anyone think of something we can look at to diagnose this?
Thanks for any help, Matthew Cahn and Wenfan Ke