Open rohanbanerjee opened 7 months ago
looks like you are using a new verion of my scripts that don't actually need SCOOP anymore, but rather use ray ( ray.io ) to paralellize execution.
Thank you for your quick response @vfonov . I am indeed using the version which uses ray
. What I am afraid of is if it is compatible with the clusters in Alliance Canada. I did install ray
on the cluster as mentioned in the above issue. But I also do suspect that this error is caused not due to ray
but due to minc-toolkit-v2
. Quick question: have you come across this issue
Message: b'/tmp/ebuser/avx512/MINCToolkit/1.9.18.1/GCC-9.3.0/minc-toolkit-v2/libminc/volume_io/Prog_utils/print.c:226 (from mivarput1): volume_io error: copy_volume(): copying cached volumes not implemented.\n\n'
I'm asking this because this is a common error line irrespective of whether I use the version where scoop
or ray
is used and I think this might be the root of the issue I am facing.
I haven't seen this error message appearing before. When does this happen?
This happens when I use the below csv file (which contains paths to normalized straightened .mnc files and template mask), for example, a line from my csv file looks like below:
/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/sub-HarshmanDobby_T2_straight_norm.mnc,/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc
Then I use this subjects.csv
and pass it in this script generate_template.py and launch it on the cluster to run. If I use the scoop
version the process crashes with the following error message:
ok, it looks like environment variable VOLUME_CACHE_THRESHOLD
is set to value that's smaller then the volume size that you are using in template building.
Can you set it to -1, to completely disable ?
I,e export VOLUME_CACHE_THRESHOLD=-1
when you setup your environment.
This works perfectly with the scoop
version, thank you! I am now testing it with the latest version (which uses Ray
) and updating if it works fine or not.
Moving this - https://github.com/spinalcordtoolbox/template-dog/issues/18 issue to this repository since it is more relevant here.
The
generate_template
script is dependent on, as described here:We have been using the SHA
cadc7219e79d6edb90742e1e340f8eee76332006
version of thenist_mni_piplelines
which used thescoop
package for parallelizing. The newer versions (I'm using the commit608acff75601bf80f79334abc0434bbc0734af0d
)of thenist_mni_pipelines
uses theray
package. Now when I try to use installray
bypip install ray
, the jobs crash and run into the following error:error stack
``` [2024-04-04 07:13:48,381] launcher INFO SCOOP 0.7 2.0 on linux using Python 3.8.10 (default, Jun 16 2021, 14:19:02) [GCC 9.3.0], API: 1013 [2024-04-04 07:13:48,382] launcher INFO Detected SLURM environment. [2024-04-04 07:13:48,382] launcher INFO Deploying 1 worker(s) over 1 host(s). [2024-04-04 07:13:48,382] launcher DEBUG Using hostname/ip: "bc11259" as external broker reference. [2024-04-04 07:13:48,382] launcher DEBUG The python executable to execute the program with is: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python. [2024-04-04 07:13:48,382] launcher INFO Worker distribution: [2024-04-04 07:13:48,382] launcher INFO bc11259: 0 + origin [2024-04-04 07:13:48,816] brokerLaunch (127.0.0.1:36071) DEBUG Local broker launched on ports 36071, 33491. [2024-04-04 07:13:48,816] launcher (127.0.0.1:36071) DEBUG Initialising local origin worker 1 [bc11259]. [2024-04-04 07:13:48,816] launcher (127.0.0.1:36071) DEBUG bc11259: Launching 'env PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python -m scoop.launch.__main__ 1 3 --size 1 --workingDirectory /lustre04/scratch/rohanb1/dog_template/template --brokerHostname 127.0.0.1 --externalBrokerHostname bc11259 --taskPort 36071 --metaPort 33491 --origin --backend=ZMQ -vvv generate_template_pediatric.py' Launching 1 worker(s) using /bin/bash. Executing '['/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python', '-m', 'scoop.bootstrap.__main__', '--size', '1', '--workingDirectory', '/lustre04/scratch/rohanb1/dog_template/template', '--brokerHostname', '127.0.0.1', '--externalBrokerHostname', 'bc11259', '--taskPort', '36071', '--metaPort', '33491', '--origin', '--backend=ZMQ', '-vvv', 'generate_template_pediatric.py']'... 2024-04-04 07:14:35,671 INFO worker.py:1553 -- Started a local Ray instance. [2024-04-04 07:15:06,066 E 449836 449836] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory [2024-04-04 07:15:06,132] launcher (127.0.0.1:36071) INFO Root process is done. [2024-04-04 07:15:06,132] workerLaunch (127.0.0.1:36071) DEBUG Closing workers on bc11259 (1 workers). [2024-04-04 07:15:06,132] brokerLaunch (127.0.0.1:36071) DEBUG Closing local broker. [2024-04-04 07:15:06,132] launcher (127.0.0.1:36071) INFO Finished cleaning spawned subprocesses. ```
I did some search and found a temporary fix to this issue here: https://stackoverflow.com/a/72492737 which did resolve the above error but the job still crash and following is the crash output (attached): slurm-46365536.out.zip
Steps to reproduce this issue:
scratch
folder on Compute Canada and unzip the filebids_data_final/derivatives/template/subjects.csv
and update the pathsI'm trying to solve this issue on my side but if anyone has any insights, pls share! (tagging @namgo if you have any information on this)