key4hep / key4hep-spack

A Spack recipe repository of Key4hep software.
10 stars 23 forks source link

Possible clash with machine learning tool #568

Closed Silence2107 closed 5 months ago

Silence2107 commented 7 months ago

From lxplus (checked on few nodes, including 978) I run source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh and attempt to run a simple MNIST training with PyTorch. Once I run the training script, I receive an error seemingly c++ like (see output).

I can certainly state that this occurs at the choice of the optimizer (i.e. the training data and pytorch loading are seemingly in order), and, based of it argues about onnx already existing, I suspect you can help me with sorting this out.

(A)PS: on lxplus with AFS/CVMFS access

source /cvmfs/sw-nightlies.hsf.org/key4hep/setup.sh
cd /tmp/`whoami`
mkdir testThis
cd testThis
cp /afs/cern.ch/user/p/ppanasiu/public/temp/temp.py .
cp /afs/cern.ch/user/p/ppanasiu/public/temp/mnist_train.csv .
python temp.py
tmadlener commented 7 months ago

Thanks for reporting this. We have so far not really tested running training with the torch and ONNX installations that come with a Key4hep stack. We have used it to run inference from C++ programs linking against the corresponding libraries.

We currently also don't build these library with any form of GPU support, but rather CPU only. From a quick look at your temp.py it doesn't look like you explicitly request either CPU or GPU, so pytorch might default to something unsupported (although I really doubt that).

Are you able to run this example with another installation of pytorch? E.g. with /cvmfs/sft.cern.ch/lcg/views/LCG_105/105/x86_64-el9-gcc12-opt/setup.sh?

Silence2107 commented 7 months ago

Hi, Thanks for the explanation.

Indeed, this example is runnable under the lcg source you have provided (haven't tried GPU yet but tbc). We are wondering, however, whether we can do anything at the moment (local/global hotfix or else) to perform the training under key4hep? If you have any idea, I would like to know.

Thank you for your time.

tmadlener commented 7 months ago

Thanks for testing / confirming. Unfortunately, I don't really have a quick solution for this. It is also not entirely clear to me yet what is going wrong. We will have to investigate.

andresailer commented 7 months ago

(haven't tried GPU yet but tbc).

I gather @BrieucF's message on mattermost implies that you found, that the LCG stack advertised above, doesn't work on GPU.

Use the the CUDA stack from LCG

source /cvmfs/sft.cern.ch/lcg/views/LCG_105a_cuda/x86_64-el9-gcc11-opt/setup.sh                                                                                                                          
cd /tmp/`whoami`
mkdir testTorch
cd testTorch
cp /afs/cern.ch/user/p/ppanasiu/public/temp/temp.py .
cp /afs/cern.ch/user/p/ppanasiu/public/temp/mnist_train.csv .
python temp.py

Gives

cuda
[1,   100] loss: 0.780
[1,   200] loss: 0.220
[1,   300] loss: 0.146
[1,   400] loss: 0.106
[snip]
jmcarcell commented 5 months ago

Hi @Silence2107. This has been fixed in the current nightlies (or at least I can run the training on Alma 9 now). I have no idea what was going wrong, but I was able to reproduce. Pytorch was updated from 2.2.1 to 2.3.0 if that helped somehow :shrug: . I'll add a test to make sure these very simple trainings work with pytorch in the future.