checkpoint function or restart

ganiatgithub commented 1 year ago

Hello Vini,

I've been running metaphor for a large dataset. Due to a limit of wall time, my job was killed before finished. Here is my code for running:

mamba activate metaphor

cd /home/gnii0001/rk81_scratch/gaofeng/projects/12_nutrition/script
metaphor execute \
-f metaphor_settings.yaml \
-c 20

mamba deactivate

Setting is as follows:

cat metaphor_settings.yaml 
samples: samples.csv
data_dir: DEFAULT
cores_per_big_task: 1
cores_per_medium_task: 0.5
cores_per_small_task: 0.25
max_mb: 200000
scheduler: false
transparent_background: true
dpi: 600
output_format: png
fastp:
  activate: true
  length_required: 50
  cut_mean_quality: 30
  extra: --detect_adapter_for_pe
merge_reads:
  activate: true
host_removal:
  activate: true
  reference: /home/gnii0001/12_nutrition/data/00_reads/human_genome/chm13v2.0.fa.gz
fastqc:
  activate: true
multiqc:
  activate: true
coassembly: false
megahit:
  preset: meta-large
  min_contig_len: 1000
  remove_intermediate_contigs: true
rename_contigs:
  activate: true
  awk_command: awk '/^>/{{gsub(" |\\\\.|=", "_", $0); print $0; next}}{{print}}' {input}
    > {output}
metaquast:
  activate: false
  coassembly_reference: ''
prodigal:
  activate: true
  mode: meta
  quiet: true
  genes: false
  scores: false
prokka:
  activate: false
  args: --quiet --force
diamond:
  db: COG2020/cog-20.dmnd
  db_source: COG2020/cog-20.fa.gz
  output_type: 6
  output_format: qseqid sseqid stitle evalue bitscore staxids sscinames
cog_functional_parser:
  activate: true
  db: COG2020
lineage_parser:
  activate: true
  taxonmap: COG2020/cog-20.taxonmap.tsv
  rankedlineage: taxonomy/rankedlineage.dmp
  names: taxonomy/names.dmp
  nodes: taxonomy/nodes.dmp
  download_url: https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
plot_cog_functional:
  activate: true
  filter_categories: true
  categories_cutoff: 0.01
plot_taxonomies:
  activate: true
  tax_cutoff: 20
  colormap: tab20c
cobinning: true
vamb:
  activate: true
  minfasta: 100000
  batchsize: 256
metabat2:
  activate: true
  seed: 0
  preffix: bin
concoct:
  activate: true
das_tool:
  activate: true
  score_threshold: 0.5
  bins_report: true
postprocessing:
  activate: true
  runtime_unit: m
  runtime_cutoff: 5
  memory_unit: max_vms
  memory_cutoff: 1
  memory_gb: true

Currently, 26 out of 90 assemblies with megahit were completed. Is there a good way to restart from a check point?

Also, if I restart, can I increase cpu counts to speed thing up?

Many thanks and sorry if there's an obvious answer I've missed.

camilogarciabotero commented 1 year ago

Hey @ganiatgithub

I have used -e " --rerun-incomplete" in the execution line for re-running the workflow. It will (in theory) start from a checkpoint. Sometimes it is good to delete the .snakemake folder that it creates in your execution folder. So the line will go as:

metaphor execute -f metaphor_settings.yaml -c 20 -e " --rerun-incomplete"

I guess that if you change the -c in the line you will rerun with that amount of cores.

vinisalazar commented 1 year ago

Hi @ganiatgithub,

@camilogarciabotero that is correct. But unless you get an error message specifically saying you need it, there's no need to add the -e " --rerun-incomplete" flag. Just run the pipeline like you would normally and it will continue from where it left off, assuming the outputs haven't been deleted.

Also, if I restart, can I increase cpu counts to speed thing up?

Yes, no problems with that.

By the way, if you are using a HPC/scheduler system, you may consider using a profile, such as this SLURM one, and turn the scheduler setting to true. This launches separate scheduler jobs for each of your tasks, which is usually better for scaling. I will see about adding a template profile so it's easier to configure Metaphor this way.

Please let us know what other questions you may have.

Thank you, Vini

camilogarciabotero commented 1 year ago

Hey V,

The information for creating a profile and running from HPC is just what I needed, thank you so much. Is there any chance to include a particular example in the docs any time in the future? I'm trying to follow the information but it looks like it compromises several steps or at least several alternatives... What would be the most straightforward path for setting up a job?

Best, Camilo.

ganiatgithub commented 1 year ago

Many thanks both.

To proceed I would need to unlock the Snakefile by running the following:

snakemake --snakefile /home/gnii0001/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpb36nl_hj/file/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/metaphor/lib/python3.11/site-packages/metaphor/workflow/Snakefile --unlock

Before that I need to locate the Snakefile using: find ~ -name Snakefile

Can you also help me understand how the profile approach will work in reality?

If I have 100 samples in my dataset, and I've set maximum 10 jobs to be run at a time. Will it handle 10 samples from beginning to end, then move on to the next? Or it would work in stages, i.e. finishing with qc of all samples them move on to assembly?

It would be much appreciated if you could provide some documentation for incorporating profiles.

vinisalazar commented 1 year ago

To proceed I would need to unlock the Snakefile by running the following:

snakemake --snakefile /home/gnii0001/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpb36nl_hj/file/home/gnii0001/rp24/gaofeng/tools/Miniconda3/envs/metaphor/lib/python3.11/site-packages/metaphor/workflow/Snakefile --unlock

Before that I need to locate the Snakefile using: find ~ -name Snakefile

That is correct. If Metaphor is interrupted involuntarily, it may require the unlock command to be ran. The --unlock flag is also available directly from the Metaphor command:

metaphor execute -c 2 --unlock

If you do need to run Snakemake-specific flags, however, you can find the Metaphor executable by running:

metaphor config show --snakefile

Can you also help me understand how the profile approach will work in reality?

It is difficult to provide a single answer for that as every cluster configuration is different. For me, what works is opening a UNIX screen on my login node and running the Metaphor command from there with the scheduler profile. Some systems however have very strict limitations on login nodes, and you would need to submit the Snakemake process as a job to the scheduler as well. The profile I use is the one I linked in my comment above, but maybe you can try this one if you are working with SLURM: smk-simple-surm.

You run Metaphor exactly the same as you would normally, but you pass the -l or --profile flag pointing to the profile directory, and turn scheduler: true in metaphor_settings.yaml:

metaphor execute -c 20 -l <your-profile-directory>

If I have 100 samples in my dataset, and I've set maximum 10 jobs to be run at a time. Will it handle 10 samples from beginning to end, then move on to the next? Or it would work in stages, i.e. finishing with qc of all samples them move on to assembly?

This is random. Snakemake will calculate the execution DAG and if a job is not required by another, that is, they are independent, the order of their execution will be random. If you set max jobs to 10, it will always try to have 10 jobs at a time, and once one job finishes, the next one begins, as long as the specified resources allow that.

It would be much appreciated if you could provide some documentation for incorporating profiles.

Thank you for the feedback, I will try to do that. In the meantime, the documentation in the Snakemake configuration profiles (such as the cookiecutter one from the comment above and the smk-simple-slurm in this comment) should be able to help you.

Best, Vini

vinisalazar commented 1 year ago

I am going to go ahead and close this issue as there appears to be no outstanding action points, but please don't hesitate to reopen it if you continue to have problems.

Thank you. Vini

vinisalazar / metaphor

checkpoint function or restart #49