stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.
Apache License 2.0
562 stars 49 forks source link

Fork Preprocessing when doing multiple gpus #138

Closed dlwh closed 2 years ago

dlwh commented 2 years ago

Fixes #117

So, as hypothesized in #126 by @jthickstun, parallelized preprocessing was leading to major problems, particularly as one added more GPUs. To mitigate this, I've now made it so that preprocessing is forked into a single separate process per-machine (I'm assuming we're using local file systems for caches)

dlwh commented 2 years ago

@J38 ping

J38 commented 2 years ago

In terms of the code this seems fine to me if we want to go with this approach!

J38 commented 2 years ago

I will say I have one quibble with this approach ... I am concerned (but not outright opposed) with having the data pre-processing be within the deepspeed launcher call and the general complexity of the various processes communication when what we simply want is "don't run a bunch of data pre-processing calls" ... we may wish we had opted for a simpler process of 1. data pre-process 2. deepspeed the training stuff ... it is possible this will be fine and we'll never worry about it again ... but it is also possible we'll end up with a bunch of weird hangs because of some bug/issue involved with the process communication and wish we had opted for a simpler set up ... there is something to be said for a new set up that is basically just 1. run the data pre-processing without deepspeed 2. use the deepspeed launcher just for the GPU training stuff ...

dlwh commented 2 years ago

i think that's reasonable. i'm a bit wary of all that just because so much of the way people are told/trained to use deepspeed (and torch distributed launch) is to use the launcher script at top level, I'm not sure it will behave if we do an internal fork. if we want to explicitly have two phases you have to run, that's fine too (and pretty trivial to support with the new code)