texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

Train in multi-node multi-card environment #110

Open Atlantic8 opened 3 months ago

Atlantic8 commented 3 months ago

can I use tevatron to train models in multi-node multi-card environment ? if yes, could you please give script examples to demonstrate how to start the job, thank you

MXueguang commented 3 months ago

Hi @Atlantic8, for pytorch implementation, unfortunately, we didn't get chance to run&test on multi-node environment yet.

luyug commented 3 months ago

add on top of it, for jax it really depends on your cluster. for cloud TPU,

gcloud compute tpus tpu-vm ssh YOUR_TPU_NAME \
    --zone=us-central2-b \
    --worker=all \
    --command="python -m tevatron.tevax.experimental.mp.train ..."

Adjust it to use the launch script that fits your cluster config.