nf-core / deepmodeloptim

Stochastic Testing and Input Manipulation for Unbiased Learning Systems
https://nf-co.re/deepmodeloptim
MIT License
23 stars 9 forks source link

Add `ps` to all containers #141

Closed evanfloden closed 6 months ago

evanfloden commented 6 months ago

By having the ps package in each container allows the gathering of task metrics.


Error executing process > 'CHECK_MODEL:CHECK_TORCH_MODEL (titanic_stimulus.json-titanic_stimulus.csv)'

Caused by:
  Essential container in task exited

Command executed:

  launch_check_model.py -d titanic_stimulus.csv -m titanic_model.py -e titanic_stimulus.json -c titanic_model.yaml

Command exit status:
  1

Command output:
  (empty)

Command error:
  Command 'ps' required by nextflow to collect task metrics cannot be found
  12:15PM INF shutdown filesystem start
  12:15PM INF shutdown filesystem done

Work dir:
  s3://XXXXXscratch/5vgaUbWnwh9Mjf/6d/e06234d28e40286134252a090f65a4

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
evanfloden commented 6 months ago

See procps-ng

https://github.com/nextflow-io/rnaseq-nf/blob/master/docker/Dockerfile#L12

alessiovignoli commented 6 months ago
# mambaorg/micromamba:1.5.8-bookworm-slim
FROM mambaorg/micromamba@sha256:abcb3ae7e3521d08e1fdeaff63131765b34e4f29b6a8a2c28660036b53841569
# python 3.11.8-slim-bullseye intes64
FROM python@sha256:a2d01031695ff170831430810ee30dd06d8413b08f72ad978b43fd10daa6b86e
 LABEL maintainer="Alessio Vignoli" \
        name="alessiovignoli3/stimulus:latest" \
         description="Docker image containing python packages required for stimulus using modules"

RUN micromamba install -y -n base -c defaults -c bioconda -c conda-forge \
        python=3.11.8 \
        typing_extensions=4.11.0 \
        importlib_metadata=7.1.0 \
        numpy=1.26 \
        pytorch-lightning=2.0.1 \
        polars=0.20.19 \
        scikit-learn=1.3.0 \
        ray-tune=2.12.0 \
        ray-train=2.12.0 \
        procps-ng=4.0.4 \
        matplotlib==3.8.2 \
        && micromamba clean -a -y

ENV PATH="$MAMBA_ROOT_PREFIX/bin:$PATH"
USER root

using this image i only managed to run it once on the CRG cluster. the other times it always throws this error:

N E X T F L O W  ~  version 23.10.1
Launching `main.nf` [deadly_ptolemy] DSL2 - revision: 32ea66cbec
executor >  crg (1)
executor >  crg (1)
[7a/be75a4] process > CHECK_MODEL:CHECK_TORCH_MODEL (titanic_stimulus.json-titanic_stimulus.csv) [100%] 1 of 1, failed: 1 ✘
[-        ] process > HANDLE_DATA:INTERPRET_JSON                                                 -
[-        ] process > HANDLE_DATA:SPLIT_CSV:STIMULUS_SPLIT_CSV                                   -
[-        ] process > HANDLE_DATA:TRANSFORM_CSV:STIMULUS_TRANSFORM_CSV                           -
[-        ] process > HANDLE_DATA:SHUFFLE_CSV:STIMULUS_SHUFFLE_CSV                               -
[-        ] process > HANDLE_TUNE:TORCH_TUNE                                                     -
Execution cancelled -- Finishing pending tasks before exit
Done
ERROR ~ Error executing process > 'CHECK_MODEL:CHECK_TORCH_MODEL (titanic_stimulus.json-titanic_stimulus.csv)'

Caused by:
  Process `CHECK_MODEL:CHECK_TORCH_MODEL (titanic_stimulus.json-titanic_stimulus.csv)` terminated with an error exit status (132)

Command executed:

  launch_check_model.py -d titanic_stimulus.csv -m titanic_model.py -e titanic_stimulus.json -c titanic_model.yaml

Command exit status:
  132

Command output:
  (empty)

Command error:
  .command.sh: line 2:    15 Illegal instruction     (core dumped) launch_check_model.py -d titanic_stimulus.csv -m titanic_model.py -e titanic_stimulus.json -c titanic_model.yaml

Work dir:
  /nfs/users/cn/avignoli/stimulus/work/7a/be75a4998174e0b2e5a9c57df4e8cb

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

I guess it has to do with how recent the cpu are on the cluster. Same image (that seems fine in terms of packages and dependecies) sometimes runijnng sometimes not.

The image present in #150 throws the following error on the sequera cloud platform here. It seems ps is not an issue anymore but the memory of the node is.

Solved in pr #150