Closed GeoMicroSoares closed 3 years ago
just an update - had to add a line to cluster_ava_alignments.py
increasing the recursion limit for python before running the rest of the script and that helped (currently running): sys.setrecursionlimit(10000)
. This may overload other machines, but working thus far on the server I'm working with.
That's a good solution. We should add a configuration option to let you do that without modifying the code.
In the meantime, for anyone who doesn't want to muck about in the code, or if you don't have the computing resources to handle really complicated clusters, the "solution" we came up with with was to skip clusters that threw these errors. See the README for the SKIP_BINS configuration option:
SKIP_BINS:<sample>:<stype>:<version>
: List of k-mer bins to skip for analysis. HDBSCAN assigns all unbinned reads to bin_id = -1, which should always be skipped. Occasionally downstream errors in the pipeline indicate that additional bad k-mer bins should also be skipped. Hi again @jmeppley just ran into a segmentation fault while the pipeline tried to run cluster_ava_alignments.py
on bin -1 (unbinned reads). This even though it's included in my config.yml
that that bin should be skipped:
SKIP_BINS:
sample:
1D:
v1: [-1]
v2: [-1]
Is there a way to forcing this somehow? Error message below. Thanks!
python /mnt/XIO_4_via_NFS/test_dtr/DTR-phage-pipeline/scripts/cluster_ava_alignments.py -p /data6/test_dtr/sample/1D/v1/kmer_binning/refine_bins/alignments/-1/-1.clust -s 0.8 -n 5 /data6/test_dtr/sample/1D/v1/kmer_binning/refine_bins/alignments/-1/-1.ava.paf
Activating conda environment: /mnt/XIO_4_via_NFS/test_dtr/DTR-phage-pipeline/.snakemake/conda/60ec611e186987fffbc0b287faa5a054
/bin/bash: line 1: 16405 Segmentation fault (core dumped) python /mnt/XIO_4_via_NFS/test_dtr/DTR-phage-pipeline/scripts/cluster_ava_alignments.py -p /data6/test_dtr/sample/1D/v1/kmer_binning/refine_bins/alignments/-1/-1.clust -s 0.8 -n 5 /data6/test_dtr/sample/1D/v1/kmer_binning/refine_bins/alignments/-1/-1.ava.paf
Quick edit: Only have v1 running - commented v2 out of the config.yml
but without change, bin -1 still being processed. --keep-going
doesn't work after the seg fault..
Yeah, that seems odd to me. Can you send your entire config file?
Sure thing, follows below:
$ cat ../test_dtr/DTR-phage-pipeline/config.yml
#####################
# Editing mandatory #
#####################
# Run details. These primarily determine where the output ends up (see output_root below)
sample: "sample"
stype: "1D"
version: "v1"
# Path of the read (*.fasta) & sequencing summary files you want to analyze
input_fasta: /data6/test_dtr/merged_SM_L_sup.fasta
input_summary: /data6/test_dtr/sequencing_summary.txt
# Path where you would like pipeline results to be written.
# Run outputs will end up in in the directory: <output_root>/<sample>/<stype>/<version>/
output_root: /data6/test_dtr/
max_threads: 32 # Maximum number of threads the pipeline will attempt to use
pre_filter: DTR # Only process reads with DTR (None: all reads, Virsorter: virsorter)
# virsorter is purely experimental at this point
# Medaka model depends on whether MinION or PromethION used for sequencing and which Guppy version
# was used for basecalling. Please see Medaka documentation for list of available models.
MEDAKA:
model: 'r941_min_high_g303'
threads: 4 # Should be << max_threads so that multiple genomes can be polished in parallel
KAIJU:
db: /data6/databases/kaijudb
switch: '-a greedy -e 5' # max 5 substitutions and allows substitutions; slower but more sensitive than default params
run: True # Set to False to skip kaiju
# will also be skipped if DB is missing
# Specify which k-mer bin to skip for each sample/run/version. HDBSCAN assigns all unbinned reads to bin_id = -1,
# so this bin should always be skipped. But occasionally other 'bad' bins require skipping if they generate downstream
# errors in the pipeline.
SKIP_BINS:
sample:
1D:
v1: [-1]
# v2: [-1]
####################
# Editing optional #
####################
DTR_FIND:
ovlp: 20 # Percent of read start/end to check for alignment
chunksize: 5000
VIRSORTER:
db: 1 # Either "1" (DEFAULT Refseqdb) or "2" (Viromedb)
diamond: True # False: blastp, True: diamond
virome: False # True: input is a virome, don't train on data
prophage: False # False: only use phage, True: use prohpage, too
data_dir: '/mnt/arginine/galaxydata/seqdbs/virsorter-data'
# location of virsorter-data
KMER_CALC:
k: 5 # k-mer size to use for k-mer binning
cli: '-c' # Use raw k-mer counts instead of normalized k-mer frequencies (best for genome-spanning reads)
UMAP:
min_rl: 5000 # Minimum readlength to consider in analysis
max_cb_rl: 60000 # Maximum read length to include in colorbar for UMAP scatterplot annotated by read length
min_d: 0.1 # min_dist paramter for UMAP
neighbors: 15 # n_neighbors parameter for UMAP
min_q: 9 # Filtering out low qscore reads prior to running UMAP improves resolution of bins
bin_min_reads: 10 # Don't create a bin if the bin contains < bin_min_reads
scatter_size: 7 # Marker size for UMAP scatter plots
scatter_alpha: 0.2 # Marker alpha for UMAP scatter plots
BIN_AVA_ALIGN:
MINIMAP2:
cli: '-x ava-ont --no-long-join -r100'
threads: 1
CLUST:
min_score_frac: 0.8
min_reads: 5
DTR_ALIGN:
ovlp: 20 # Percent of read start/end to check for alignment
threads: 1 # Single thread is optimum for aligning single read start/end
NUCMER:
mincov: 98.0 # Genome-genome percent coverage threshold for deduplicating polished genomes
minpct: 98.0 # Genome-genome percent identity thresholf for deduplicating polished genomes
MINIMAP2:
STANDARD:
switches: '-x map-ont'
threads: 4
RACON:
repeat: 3 # Number of iterations to run
include-unpolished: True
window-length: 500
quality-threshold: 9
error-threshold: 0.3
match: 5
mismatch: -4
gap: -8
threads: 4
fragment-correction: True ## should not contain dual/self overlaps
PORECHOP:
switches: '--no_split'
PRODIGAL:
threads: 4
LIN_CONCAT:
ovlp_len: 3000
minlen: 15000
It looks like a past update broke the SKIP_BINS functionality by getting strings and ints mixed up. I'm working on a fix and also adding the option to override the python max recursion depth if your system has the resources.
@GeoMicroSoares, let me know if the latest code works for you
Hey @jmeppley , all works perfectly, run was just over! Just had to do a couple quick edits to .yml
files:
envs/medaka-0.11.0.yml
:
conda-forge
as a channel (just bioconda
causes conflicts for me)envs/racon.yml
:
- snakemake==5.1.5
out to have only - snakemake
; conflicts caused otherwise.Hope this helps other people and thanks so much for your help!
Hi there,
Running the pipeline I run into a recursion error (see part of it below) that is related to
scipy
. I've tried setting the recursion limit in the conda environment created during thesnakemake
run to 100000 (sys.setrecursionlimit(100000)
) but with no success. Can you provide any ideas to fix this @jmeppley? Let me know what other information I can provide. Thanks!If it's of any use, some of the reads on the
.paf
file start with a number e.g.3a93a918-197a-42e1-bb8f-c58fdbdb2b16
and I've noticed recursion errors may happen withscipy
due to this?