Open Palisand opened 1 year ago
Here is what I ran in the VM, up to the unit tests run:
sudo apt update
sudo apt install -y wget git
# miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
bash Miniconda3-py39_4.12.0-Linux-x86_64.sh
source .bashrc
# COLMAP - https://colmap.github.io/install.html
sudo apt install -y \
cmake \
build-essential \
libboost-program-options-dev \
libboost-filesystem-dev \
libboost-graph-dev \
libboost-system-dev \
libboost-test-dev \
libeigen3-dev \
libsuitesparse-dev \
libfreeimage-dev \
libmetis-dev \
libgoogle-glog-dev \
libgflags-dev \
libglew-dev \
qtbase5-dev \
libqt5opengl5-dev \
libcgal-dev
sudo apt install -y libatlas-base-dev libsuitesparse-dev
git clone https://ceres-solver.googlesource.com/ceres-solver
cd ceres-solver
git checkout $(git describe --tags) # Checkout the latest release
mkdir build
cd build
cmake .. -DBUILD_TESTING=OFF -DBUILD_EXAMPLES=OFF
make -j
sudo make install
cd ~
git clone https://github.com/colmap/colmap.git
cd colmap
git checkout dev
mkdir build
cd build
cmake ..
make -j
sudo make install
colmap -h
## Install & configure MultiNERF
cd ~
git clone https://github.com/google-research/multinerf.git
cd multinerf
conda create --name multinerf python=3.9
conda activate multinerf
conda install pip
pip install --upgrade pip
pip install -r requirements.txt
pip install tensorflow==2.9.1 # match TPU software version
git clone https://github.com/rmbrualla/pycolmap.git ./internal/pycolmap
./scripts/run_all_unit_tests.sh
How are you running this on a Google TPU? We train our models on Google TPUs but using the internal interface, which is different from the publicly available one. I don't think this code has yet been run through the external interface. Have you verified that you can run other models on the TPUs you're using? It seems like the issue here is at a lower level than this codebase here --- maybe a jax/cuda/driver issue?
Ah, I see. I am using the publicly available interface, following google's Cloud TPU documentation. I haven't verified other models.
To create the TPU VM, I ran:
gcloud config set project multinerf
gcloud services enable tpu.googleapis.com
gcloud beta services identity create --service tpu.googleapis.com
gcloud alpha compute tpus tpu-vm create tpu-multinerf --zone us-central1-b --accelerator-type v3-8 --version tpu-vm-tf-2.9.1
I then SSHed into the VM:
gcloud alpha compute tpus tpu-vm ssh tpu-multinerf --zone us-central1-b
And ran the aforementioned commands.
Before using the TPU VM, I tested these commands locally, in a Docker container running Ubuntu 20.04 (just like the VM). The tests succeeded in the container.
I tried again from scratch. This time, I removed jax
, jaxlib
, and tensorflow
from requirements.txt
and then I ran:
pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install tensorflow==2.9.1
pip install -r requirements.txt
For jax
: https://github.com/google/jax/#pip-installation-google-cloud-tpu
TPU type: v3-8 TPU software version: tpu-vm-tf-2.9.1
Installed PIP packages / `pip freeze` output (click to expand)
``` absl-py==1.2.0 asttokens==2.0.8 astunparse==1.6.3 backcall==0.2.0 cachetools==5.2.0 certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi charset-normalizer==2.1.1 chex==0.1.4 colorama==0.4.5 commonmark==0.9.1 cycler==0.11.0 decorator==5.1.1 dm-pix==0.3.3 dm-tree==0.1.7 etils==0.7.1 executing==1.0.0 flatbuffers==1.12 flax==0.6.0 fonttools==4.37.1 gast==0.4.0 gin-config==0.5.0 google-auth==2.11.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.48.1 h5py==3.7.0 idna==3.3 importlib-metadata==4.12.0 importlib-resources==5.9.0 ipython==8.5.0 jax==0.3.17 jaxlib==0.3.15 jedi==0.18.1 keras==2.9.0 Keras-Preprocessing==1.1.2 kiwisolver==1.4.4 libclang==14.0.6 Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.5.3 matplotlib-inline==0.1.6 mediapy==1.1.0 msgpack==1.0.4 numpy==1.23.3 oauthlib==3.2.1 opencv-python==4.6.0.66 opt-einsum==3.3.0 optax==0.1.3 packaging==21.3 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.2.0 prompt-toolkit==3.0.31 protobuf==3.19.4 ptyprocess==0.7.0 pure-eval==0.2.2 pyasn1==0.4.8 pyasn1-modules==0.2.8 Pygments==2.13.0 pyparsing==3.0.9 python-dateutil==2.8.2 PyYAML==6.0 rawpy==0.17.2 requests==2.28.1 requests-oauthlib==1.3.1 rich==11.2.0 rsa==4.9 scipy==1.9.1 six==1.16.0 stack-data==0.5.0 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.1 tensorflow-estimator==2.9.0 tensorflow-io-gcs-filesystem==0.27.0 termcolor==2.0.0 toolz==0.12.0 traitlets==5.3.0 typing_extensions==4.3.0 urllib3==1.26.12 wcwidth==0.2.5 Werkzeug==2.2.2 wrapt==1.14.1 zipp==3.8.1 ```Note the
tensorflow
version matches the TPU software version (2.9.1).The test failures:
Training errors: