jalammar / ecco

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
https://ecco.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.96k stars 167 forks source link

Support for Distributing Model on Multiple GPU's #81

Closed chrispan68 closed 1 year ago

chrispan68 commented 1 year ago

I'm working with a pretty big model (GPTNeo1.3B) and the gradient attribution features cause cuda out of memory errors on the GPU's I have access to.

I know that the huggingface from_pretrained method takes in an optional parameter 'device_map', that if set to "auto" will split model params into all available GPU's. The syntax is:

from transformers import AutoModelForSeq2SeqLM t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto")

I was wondering if there was a similar feature in ecco? If not, could I submit a PR sometime in the near future?

jalammar commented 1 year ago

No similar feature exists in Ecco currently. A PR is welcomed!

BiEchi commented 1 year ago

@chrispan68 To fit your model to limited GPUs, there are mainly two ways to go:

  1. Modify the ecco source code to save only attribution features you would like to gain.
  2. Implement multi-GPU implementation, which would require gradient backprop across several GPUs on a single node.

I'm also looking into these two approaches. We can have a discussion if you're still interested.

BiEchi commented 1 year ago

Just worked it out @chrispan68 :

  1. Set device_map=True in from_pretrained() from HuggingFace (in script ecco/src/ecco/__init__.py).
  2. Set gpu=False in from_pretrained() from Ecco (in your own script that uses ecco), to avoid moving the model and data to a specific device (see https://github.com/huggingface/transformers/issues/25145).

Then you can run Ecco with multiple GPUs just like using huggingface.