Closed linhduongtuan closed 1 year ago
Hi Linh, not as of now, but feel free to add a PR for the support. This shouldn't take too long. There was no need to add multi-GPU support for my setup. Training pubchem23 runs in under 2 days on a sing A100.
The models aren't too large, since only an "adapter" is trained; the large models are used as frozen encoders only once to embed them.
Best, Philipp
Just checked; torch '2.0.1+cu117' works fine for me for training
Hi @phseidl Philipp, Thank you for answering my question. It's good to know you used an adapter for downstream finetuning on the PubChem23 dataset.
Since I am struggling to preprocess the PubChem23 dataset, would you be able to share your preprocessed PubChem23 data with me?
I have a follow-up question - do you plan to make a PR to the Huggingface Hub? It would be great if you could push your results and source code there. Sometimes I encounter errors with the mlflow
package. In my opinion, using wandb
to monitor everything would be fine.
Best, Linh
Hi @linhduongtuan , sorry for the late response; I have added a reproducible way to download the pubchem23 dataset.
wget -N -r https://cloud.ml.jku.at/s/fi83oGMN2KTbsNQ/download -O pubchem23.zip
unzip pubchem23.zip
rm pubchem23.zip
(added it to ./data/pubchem.md
)
Completely agree. Working on a new project so far, so it's not a priority, but would like to add the models to the HF-Hub.
Best, Philipp
Hi Phillip,
Do you have a PyTorch v2 training script with multi-GPU support? If so, would you be able to share it with me?
As far as I know, your Arxiv paper states that the total compute runtime was around 170 days and 800 runs (without linear probing). I am wondering why you used only one GPU to train the huge model.
Have a nice weekend. Linh