modular-ml / wrapyfi-examples_llama

Inference code for facebook LLaMA models with Wrapyfi support
GNU General Public License v3.0
130 stars 13 forks source link

Cpu/memory requirements for worker nodes #10

Open ACiDGRiM opened 9 months ago

ACiDGRiM commented 9 months ago

This is exactly what I'm looking for to extend my existing cluster that is high cpu/RAM AND 0 GPU. Can you give some insight if the workers can run on low cpu/ram systems, such as a series of rpi 5 with RTX 4090 over 1x picie, while the master processes checkpoint reallocation using high cpu/ram capacity? Also is a gigabit cluster network sufficient to relay MQ messages between workers

fabawi commented 9 months ago

This implementation of Llama does not support model loading on CPU, so providing insights on your use case is not possible. However, what you are proposing could be implemented with the underlying framework (Wrapyfi). Wrapyfi allows you to transmit tensors using any of four middleware (ROS/2, ZeroMQ, YARP) without having to convert them to native objects. You can chunk the layers into separate methods, load the transformer block model weights on individual machines, publish or listen to the tensors arriving from some of the chunks in sequence, and eventually form a chain of small methods, each triggering on a specified device.