Closed dlwh closed 2 years ago
@J38 ping
In terms of the code this seems fine to me if we want to go with this approach!
I will say I have one quibble with this approach ... I am concerned (but not outright opposed) with having the data pre-processing be within the deepspeed launcher call and the general complexity of the various processes communication when what we simply want is "don't run a bunch of data pre-processing calls" ... we may wish we had opted for a simpler process of 1. data pre-process 2. deepspeed the training stuff ... it is possible this will be fine and we'll never worry about it again ... but it is also possible we'll end up with a bunch of weird hangs because of some bug/issue involved with the process communication and wish we had opted for a simpler set up ... there is something to be said for a new set up that is basically just 1. run the data pre-processing without deepspeed 2. use the deepspeed launcher just for the GPU training stuff ...
i think that's reasonable. i'm a bit wary of all that just because so much of the way people are told/trained to use deepspeed (and torch distributed launch) is to use the launcher script at top level, I'm not sure it will behave if we do an internal fork. if we want to explicitly have two phases you have to run, that's fine too (and pretty trivial to support with the new code)
Fixes #117
So, as hypothesized in #126 by @jthickstun, parallelized preprocessing was leading to major problems, particularly as one added more GPUs. To mitigate this, I've now made it so that preprocessing is forked into a single separate process per-machine (I'm assuming we're using local file systems for caches)