scilons / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
0 stars 0 forks source link

Cannot setup environment on server to run code #1

Closed ryabhmd closed 1 month ago

ryabhmd commented 3 months ago

To run scilons_pipeline.py, I've been trying to build an image on slurm and install the datatrove[all] package (as per the instructions in the README). I've tried to re-use several images from /netscratch/enroot (e.g. python+3.10.4-bullseye.sqsh, ubuntu20+conda.sqsh) and then installing the packages but always end up with incompatibility issues in the installed packages, which results in not being able to use the image.

E.g. when I build on ubuntu20+conda.sqsh and install the datatrove library I get:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. conda-repo-cli 1.0.75 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.

However, the required version of requests in incompatible with the datasets package. Once I save the image and use it to run the code it cannot find any of the modules.

Any ideas on how to build an image to run the code? Maybe another I need to use another image to install the package in?

malteos commented 3 months ago

Do you need requests for anything in the pipeline? My best guess is that you can simply ignore this error message.

You can also use one of my images: /netscratch/mostendorff/enroot/malteos_eulm_podman.sqsh

It has datatrove==0.2.0 installed.

ryabhmd commented 3 months ago

Thanks! Your image works. :) However, now when I run the pipeline and it gets to the slurm execution part to launch a job from within the script, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'sbatch' srun: error: serv-3317: task 0: Exited with exit code 1 I tried to look at similar issues (e.g. this one) but they didn't solve the issue. Any ideas?

malteos commented 3 months ago

Slurm commands are not available within a containerized compute job. See https://github.com/scilons/datatrove/blob/main/src/datatrove/executor/slurm.py#L35

You need to start the Slurm pipeline from a login node or rewrite it to use a local execution pipeline.

lfoppiano commented 1 month ago

The local pipeline works fine and can be ran with an interactive job. I'm wondering if we want to use the slurm, should I create the environment directly in the login machine and run it there?

lfoppiano commented 1 month ago

I've installed mamba and set up a virtual environment, then I ran it from there. I'm closing this, feel free to let me know if you have further questions