Open juripapay opened 1 year ago
Yes I agree this is an issue. As discussed I think there are two options:
Spotted an issue with this on stemdl as well. I think the latest version of pytorch was causing an issue and dropping back to <2.0 fixed it.
I did some investigation with micro-mamba yesterday evening and I think we could install our own version in sciml-bench
folder, then create a new env on install for each. The tricky part is not letting conda take over the users bash env, as they might have their own conda install/environments
Keeping track of benchmark requirements: here is a script for correctly installing tensorflow & requirements for the mnist benchmark. Other tensorflow benchmarks (optics, cloud, etc.) will be similar and will mostly differ in the last line.
#!/bin/bash
set -x
# Create new environment
ENV_NAME=sciml-bench-mnist_tf_keras
conda remove -n $ENV_NAME --all -y --quiet
conda create -n $ENV_NAME python=3.9 -y --quiet
ENV_PATH=$(dirname $(dirname /home/lhs18285/miniconda3/bin/conda))/envs/$ENV_NAME
# Install conda requirements
conda install -n $ENV_NAME -c conda-forge cudatoolkit=11.2.2 cudnn=8.1.0 -y --quiet
conda install -n $ENV_NAME -c nvidia cuda-nvcc=11.3.58 -y --quiet
# Configure environment variables
mkdir -p $ENV_PATH/etc/conda/activate.d
echo "export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${ENV_PATH}/lib/" >> $ENV_PATH/etc/conda/activate.d/env_vars.sh
echo "export XLA_FLAGS='--xla_gpu_cuda_data_dir=${ENV_PATH}/lib'" >> $ENV_PATH/etc/conda/activate.d/env_vars.sh
# Work around for Ubuntu 22.04. See: https://www.tensorflow.org/install/pip
mkdir -p $ENV_PATH/lib/nvvm/libdevice
cp $ENV_PATH/lib/libdevice.10.bc $ENV_PATH/lib/nvvm/libdevice/
# Install pip requirements
conda run -n $ENV_NAME LD_LIBRARY_PATH=$ENV_PATH/lib/ python -m pip install --upgrade pip -q
conda run -n $ENV_NAME LD_LIBRARY_PATH=$ENV_PATH/lib/ python -m pip install . -q
conda run -n $ENV_NAME LD_LIBRARY_PATH=$ENV_PATH/lib/ python -m pip install "tensorflow==2.11.*" scikit-image -q
And here's the script for stemdl (and pytorch). It also includes fixing to the correct pytoch_lightning
version.
#!/bin/bash
set -e
set -x
# Create new environment
ENV_NAME=sciml-bench-stemdl_classification
conda remove -n $ENV_NAME --all -y --quiet
conda create -n $ENV_NAME python=3.9 -y --quiet
ENV_PATH=$(dirname $(dirname /home/lhs18285/miniconda3/bin/conda))/envs/$ENV_NAME
# Install conda requirements
conda install -n $ENV_NAME -y --quiet pytorch==1.13.1 torchvision==0.14.1 pytorch-cuda=11.6 -c pytorch -c nvidia
# Install pip requirements
conda run -n $ENV_NAME python -m pip install -q --upgrade pip
conda run -n $ENV_NAME python -m pip install -q "pytorch_lightning==1.9.*" scikit-learn tensorboard
conda run -n $ENV_NAME python -m pip install -q .
I started capturing install scripts for each environment in: dev/install_scripts/*.sh
. They are quite useful for testing & will be a useful documentation of the dependencies for whatever refactoring solution we design in future,
We need to think how the framework can install isolated conda environments which are application specific. This issue case up with the Hydronet benchmark which is very sensitive to library versions. If we don't install specific versions of libraries it will not work. The problem is that these dependencies might be in conflict with the previously installed libraries and it would be better to create a specific environment just for running Hydronet.
The Hydronet dependencies can be installed by the following commands:
1) Create conda environment conda create --name hydronet2 python=3.8
2) Activate conda environment activate conda hydronet2
3) Installing pytorch: conda install pytorch==1.12.0 cudatoolkit=11.3 -c pytorch -c conda-forge
4) conda install pyg -c pyg
5) conda install -c conda-forge tensorboard ase fair-research-login h5py tqdm
6) conda install -c conda-forge gdown
7) pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.12.0+cu113.html