chore: Support multi-GPU training via accelerate

bclavie commented 11 months ago

Hey! Great work on the library. I've been playing with it and ran into a few issues with in-place operations when trying to train on multiple GPUs:

BERT as implemented on HuggingFace has a known issue (this was really annoying to find) where it needs explicit position_ids to run on multi-gpu
The scatter_() call to update activations was a bit finnicky

Setting device this way also really doesn't play nice with the default tokeniser export, so there's a workaround to export the files individually rather than risky JSON decoding.

I've also added a doc page to show how simple it is to parallelise training with just those few changes and some very slightly code modifications in a trading script.

raphaelsty commented 11 months ago

Congratulation for this amazing work @bclavie 🤩,

Thank you also for the documentation with the DataLoader.

I'll run your branch in the following days to make sure everything run smoothly and then merge and release a new version.

bclavie commented 11 months ago

Thank you! Please do let me know if you run into any issues -- things are training fine right now but I'm using a pretty weird setup so there might still be some issues.

Thank you also for the documentation with the DataLoader.

To be fair there's no code there at the moment, but I'm happy to update with mock data in a bit if you think it'd be useful!

raphaelsty commented 11 months ago

I don't have multiples GPUs (not even once) at home so I cannot mimic your environment.

I propose to add the accelerate attribute to all the models. If set to false it will call the tokenizer.encode_batch method and otherwise it will call your encoding procedure. I did this because using position_ids raise an error with distilbert but it work fine with sentence-transformers: neural_cherche/models/base.py

I also updated the documentation a bit in order to show how to create a dataset.

All tests pass locally with the code from your branch and my updates, feel free to copy paste the code I commented.

Also what version of transformers and accelerator are you using ?

bclavie commented 11 months ago

All tests pass locally with the code from your branch and my updates, feel free to copy paste the code I commented.

Hey, did you submit the comments? I can't see the suggested code anywhere, though it might be me being holiday-tired...

Thank you for taking the time to look at this and improving it! I'm running transformers==4.36.2 and accelerate==0.25.0

I've ran some more experiments, and for full disclosure so far:

Results when training SparseEmbed on a single-GPU and multi-GPU seem identical
It doesn't work as smoothly for ColBERT as for SparseEmbed, there's probably an issue there at the moment.
My position_ids workaround fixes things for BERT-based model, but currently XLMRoBERTa (and assumedly RoBERTa itself) cause the same crash as with BERT pre-modification (in-place operation modifying tensors on the wrong device)

My feeling is that it might be actually be unsafe to merge as a "mature" feature at this stage, but doing so and labelling it experimental support could be useful?

(as for neural-cherche itself, I really like the lightweight-ness of the lib, but currently I'm running into some issues where my models end up stuck in some kind of "compressed similarity" land and hard negatives are always extremely close to positives in similarity, which doesn't happen with the main ColBERT-codebase -- I'm training a ColBERT from scratch and will try to diagnose once I have more time!)

raphaelsty commented 11 months ago

Hey, did you submit the comments? I can't see the suggested code anywhere, though it might be me being holiday-tired...

Ahah missed this, sorry.

> (as for neural-cherche itself, I really like the lightweight-ness of the lib, but currently I'm running into some issues where my models end up stuck in some kind of "compressed similarity" land and hard negatives are always extremely close to positives in similarity, which doesn't happen with the main ColBERT-codebase -- I'm training a ColBERT from scratch and will try to diagnose once I have more time!)

It could come from the loss function which is quite simple? Would love to get your feedback on this if you find anything.

Overall, I think it's fine to push your work on Master if we use the flag self.accelerate, It will be a first step through the acceleration of the lib over multiple gpus ! :)

bclavie commented 11 months ago

Ahah missed this, sorry.

No worries, I've applied the changes 1:1, except for the tutorial page (added that support is partial/in-progress, so people don't get the impression it's fully supported yet!)

It could come from the loss function which is quite simple? Would love to get your feedback on this if you find anything.

I think that's probably it... I'll definitely try and figure exactly what component has the biggest impact once I've got some more time

raphaelsty / neural-cherche

chore: Support multi-GPU training via accelerate #5