khanlab / hippunfold

BIDS App for Hippunfold (automated hippocampal unfolding and subfield segmentation)
https://hippunfold.readthedocs.io
MIT License
47 stars 12 forks source link

Hippunfold errors out at 46% #285

Closed vnbcs closed 3 months ago

vnbcs commented 3 months ago

I am having a strange problem. I was successfully running Hippunfold on 33 subsets of data, each with 50 subjects. Each subset has its own BIDS-ed input folder, output folder, and slurm job. There is only one folder for slurm logs.

All jobs errored out at 46% done. Normally in the event of a crash, I would just re-run from where I left off, but I found it very weird that every job errored out at the exact same percentage, and every job errored out in less than 12 hours.

I tried to attach all 33 logs, but many of the files were too big, so I put them on Google Drive: https://drive.google.com/drive/folders/1DfS5CQGerRbVmMhVXztlLyY1byeXme4x?usp=drive_link

More context:

I run Hippunfold at my university's supercomputing center. There are multiple types of clusters available. Of the 33 subsets, 4 subsets were running on an SMP cluster, 19 subsets were running on a HTC cluster, and 10 were running on a MPI cluster. This is the template used to create each job:

#!/bin/bash
#SBATCH --account=jhanson
#SBATCH --nodes=NODES
#SBATCH --ntasks-per-node=TASKS
#SBATCH --cluster=CLUS
#SBATCH --partition=PART
#SBATCH --time=7-00:00
#SBATCH --output=/path/to/project/logs/abcd_subsetNUM_%j.out 
#SBATCH --job-name=setNUMHippABCD

module load singularity

SIFILE=/path/to/project/code/khanlab_hippunfold_latest.sif
INFOLD=/path/to/project/data/abcd_subsetNUM
OUTFOLD=/path/to/project/output/hipp_subsetNUM

mkdir $OUTFOLD

singularity run -e $SIFILE $INFOLD $OUTFOLD \
participant -p --cores TASKS --modality T1w --keep-going --force-output 

Depending on the cluster, I use 1-2 nodes, and 16 or 48 tasks per node. I am unsure how tasks and nodes map to the --cores attribute, but I set cores equal to the number of tasks. Modules are loaded using lmod. Singularity is version 3.9.6. Time limit is 6 or 7 days, depending on what is allowed for that cluster. The SIF file was built using the most recent version of the Hippunfold Docker image. Let me know if I can provide additional information.

jordandekraker commented 3 months ago

Very interesting, they all seem to be failing at the same step (called rule laplace_coords_hipp).

Could you please attach one example log from this particular step? It should be in (for example) YOUR_OUTPUT_DIR/logs/sub-NDARINV8ZBNEBU4/ses-baselineYear1Arm1/sub-NDARINV8ZBNEBU4_ses-baselineYear1Arm1_dir-PD_hemi-R_laplace-hipp.txt

akhanf commented 3 months ago

Had a look at one of the log files -- looks like the local python environment in your home directory is causing a conflict.

E.g. in the log excerpt below, you can see it is accessing the version of numpy you can installed locally: /ihome/jhanson/evb32/.local/lib/python3.9/site-packages/numpy/__init__.py instead of what is in the container (/opt/conda/...)

This happens because singularity mounts the home folder.. The easiest way to deal with this is just to remove any python libraries you have installed in ~/.local and re-run..

Traceback (most recent call last):
  File "/ix1/jhanson/abcd-mproc-release5/output/hipp_subset12/.snakemake/scripts/tmpi6k4ihmq.laplace_coords_withinit.py", line 7, in <module>
    from astropy.convolution import convolve as nan_convolve
  File "/opt/conda/lib/python3.9/site-packages/astropy/convolution/__init__.py", line 4, in <module>
    from .core import *  # noqa
  File "/opt/conda/lib/python3.9/site-packages/astropy/convolution/core.py", line 23, in <module>
    from .utils import (discretize_model, add_kernel_arrays_1D,
  File "/opt/conda/lib/python3.9/site-packages/astropy/convolution/utils.py", line 5, in <module>
    from astropy.modeling.core import FittableModel, custom_model
  File "/opt/conda/lib/python3.9/site-packages/astropy/modeling/__init__.py", line 10, in <module>
    from . import fitting
  File "/opt/conda/lib/python3.9/site-packages/astropy/modeling/fitting.py", line 39, in <module>
    from astropy.units import Quantity
  File "/opt/conda/lib/python3.9/site-packages/astropy/units/__init__.py", line 17, in <module>
    from .quantity import *
  File "/opt/conda/lib/python3.9/site-packages/astropy/units/quantity.py", line 28, in <module>
    from .quantity_helper import (converters_and_unit, can_have_arbitrary_unit,
  File "/opt/conda/lib/python3.9/site-packages/astropy/units/quantity_helper/__init__.py", line 10, in <module>
    from . import helpers, function_helpers
  File "/opt/conda/lib/python3.9/site-packages/astropy/units/quantity_helper/function_helpers.py", line 119, in <module>
    np.asscalar,
  File "/ihome/jhanson/evb32/.local/lib/python3.9/site-packages/numpy/__init__.py", line 320, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'asscalar'
[Wed Mar  6 03:32:54 2024]
Error in rule laplace_coords_hipp:
    jobid: 12166
    input: work/sub-NDARINVGZ0RFPMU/ses-baselineYear1Arm1/anat/sub-NDARINVGZ0RFPMU_ses-baselineYear1Arm1_hemi-L_space-corobl_desc-postproc_dseg.nii.gz, work/sub-NDARINVGZ0RFPMU/ses-baselineYear1Arm1/coords/sub-NDARINVGZ0RFPMU_ses-baselineYear1Arm1_dir-IO_hemi-L_space-corobl_label-hipp_desc-init_coords.nii.gz
    output: work/sub-NDARINVGZ0RFPMU/ses-baselineYear1Arm1/coords/sub-NDARINVGZ0RFPMU_ses-baselineYear1Arm1_dir-IO_hemi-L_space-corobl_label-hipp_desc-laplace_coords.nii.gz
    log: logs/sub-NDARINVGZ0RFPMU/ses-baselineYear1Arm1/sub-NDARINVGZ0RFPMU_ses-baselineYear1Arm1_dir-IO_hemi-L_laplace-hipp.txt (check log file(s) for error details)
vnbcs commented 3 months ago

@akhanf This worked! Thank you! Most of my runs seem to have finished without crashing.