Optimize GPU usage in reward models

p-ferreira commented 1 year ago

Some of the validators are getting CUDA OOM every now and then (including the test validator).

https://wandb.ai/opentensor-dev/openvalidators/runs/7p6prmo1/logs?workspace=user-opentensor-pedro

My initial hypothesis is that things are getting stacked in the GPU until they reach the limit. Considering that we have a validator that should run for days, it would be nice to identify some potential points of improvement for GPU management in order to avoid reaching the OOM point.

p-ferreira commented 1 year ago

Issue #96 provides a palliative solution for an edge case in which validators experience out-of-memory (OOM) errors.

This bug is not a show stopper as it does not happen very often and the validators get restarted gracefully with the autorun script. However, it is still necessary to investigate this behavior as it can cause inconvenience.

p-ferreira commented 1 year ago

Issue update (initial EDA):

The issue is still present as we can see from the following wandb runs:

One thing that can be observed is that this exception does not have a temporal pattern, as there are runs durations varying from 59m to 1d 21h 53m.

Plotting the GPU memory allocation of some preliminary runs from netuid 11, we can see that there is a peak that suddenly happens throughout the runs.

Looking at the GPU memory allocation of the runs mentioned above, we can verify that the gpu does not scale linearly in a consistent form throughout time.

The pattern that can be highlighted throughout the logs of those runs is that the following error happens (example of a real run):

OutOfMemoryError: CUDA out of memory. Tried to allocate 4.40 GiB (GPU 0; 39.56
 GiB total capacity; 28.93 GiB already allocated; 566.56 MiB free; 32.88 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

The error happens when a model tries to allocate a given space in GPU that surpasses the available space. Looking at ~250 rows in the stack trace before the EOF exception of the wandb log, it can be seem that **the error happens consistently with the openassistant model.**

Some simulations where done in attempt to replicate the peek, using the isolated openassistant model and the complete default reward stack of the validator.

Both tests iterated over the data of the run w60lsiy9, by passing all prompts + completions to the reward flow.

Test 1: Isolated openassistant model

Test 2: Default reward + mask stack of openvalidators

In both cases, after the initial peek of loading the model(s), the gpu usage remained stable without variations. None of the attempts resulted in an OOM exception.

Possible directions for future investigation

Track closely how the gpu changes by implementing extra observability
Optimize models to reduce gpu consumption
Verify impact of #96 once it's deployed in main

opentensor / validators

Optimize GPU usage in reward models #82

Issue update (initial EDA):