s1dlx / meh

Merging Execution Helper
MIT License
37 stars 10 forks source link

feat: work device #28

Closed ljleb closed 1 year ago

ljleb commented 1 year ago

Add the possibility to use different devices for storing vs merging keys. I get ~2x speedup for ties_add_difference:

Using --work-device cpu:

stage 1: 100%|██████████| 1131/1131 [01:32<00:00, 12.19it/s]

Using --work-device cuda:

stage 1: 100%|██████████| 1131/1131 [00:46<00:00, 24.58it/s]
s1dlx commented 1 year ago

Is the idea to keep the model in RAM and move only one block at a time to VRAM?

ljleb commented 1 year ago

Yes. Move the keys only on the work device when we are about to do work (i.e. merge stuff together)

ljleb commented 1 year ago

On my system this allows to load more models into memory while still getting a speedup with some merge methods.

s1dlx commented 1 year ago

On my system this allows to load more models into memory while still getting a speedup with some merge methods.

I see

is that as fast as loading the entire model on vram? Possibly the difference is negligible

ljleb commented 1 year ago

Not as fast but comparable in some cases. For example I do get good speedup for methods that use sorting. I'll run another test to compare ties on full gpu vs work-only gpu.

s1dlx commented 1 year ago

And I guess that when you set both to gpu is like now. Load everything at once and merge

ljleb commented 1 year ago

Wait I made a mistake. I intended the default value for --work-device to be the device specified by --device.

ljleb commented 1 year ago

With ties add difference, it is ~2x faster to use --device cuda:

stage 1: 100%|██████████| 1131/1131 [00:14<00:00, 77.05it/s]

With --work-device cuda:

stage 1: 100%|██████████| 1131/1131 [00:30<00:00, 37.56it/s]

With --device cpu (or no cli flags):

stage 1: 100%|██████████| 1131/1131 [01:18<00:00, 14.44it/s]