Impact of Ray 1.x vs Ray 2.x on Flower simulations

lbhm commented 1 year ago

Hi @jafermarq,

I saw your talk from the most recent Flower monthly today and noticed that you recommended using an older (1.11) version of Ray for Flower simulations. Since I only recently started looking into Ray and defaulted to the newest release, I would like to ask you why you recommend using an older release. Are there known limitations/bugs in Ray 2.x in the way that Flower uses it?

In the meeting, someone from the audience also mentioned out-of-memory errors. Since I have been experiencing the same, I wanted to ask you if this could be related to the Ray version.

In my first test runs on a dedicated Ray cluster, I took the Flower Strategies Example (i.e., a very small model) and trained 10 clients on a cluster of 10 nodes with one 80GB A100 each. For some reason, some clients reported CUDA out-of-memory errors in a few epochs of the training even though the small model should very comfortably fit in 80GB VRAM. Are you aware of some kind of memory leak in Ray itself or when Ray and Flower are used in combination?

jafermarq commented 1 year ago

Hi @lbhm, you ask a really good question!

The VirtualClientEngine in Flower makes use of Ray Taks to spawn virtual clients and execute one of their methods (but primarily fit(), evaluate()). The nice feature about Tasks is that they behave pretty much like a function, can be executed in a resource-aware manner (e.g. you can assign certain CPU? or GPU resources to it -- or even create your own resources). So Flower has been using Ray Tasks for a while and it has been great! However, with Ray 1.12.0 (and subsequent) something about the handling/releasing of resources between Tasks changed, which lead to OOM fairly rapidly when Tasks (i.e. our Flower clients) use GPU resources. I haven't had the chance to look into it deeply to find the exact reason -- but most likely this was done for a good reason by the maintainers of the Ray project.

A easy solution to this OOM issue was proposed: pass max_calls=1 in the @ray.remote decorating the Tasks. And this solution works. Unfortunately, in FL many times the workloads clients do are fairly short, making the overheads that using max_calls=1 brings unacceptable in most FL simulation settings. You can see more about this topic along with some numerical results in a recent PR I opened for Flower. In short, we are moving away from Tasks and instead using Actors. This could come in the next release of Flower, but still is tbd...

Coming back to your question... Yes. My advice is to downgrade to ray 1.11.1 for now. If you install first flower and later pip install ray==1.11.1 you might see some warning messages telling you about incompatibility in the grpcio versions. You can ignore this. With this version, I guarantee you will have no OOM due to Tasks not releasing their resources.

lbhm commented 1 year ago

Thank you for the quick response and the thorough explanation! Moving to Ray Actors makes sense based on what you explained.

I will have to see if backporting to Python 3.9 to meet Ray 1.11 requirements causes any issues with the rest of our setup. If you agree, I would keep this issue open for others to see until Flower moves to Ray Actors.

jafermarq commented 1 year ago

From my experience ray==1.11.1 and Python 3.9 work fine.

Happy to keep the issue open.

lbhm commented 1 year ago

As we are talking about Ray and the enhancements you plan to introduce to the way how Flower uses it, I wanted to use the opportunity to ask another Ray-Flower question:

Is there any way to control where Flower clients physically load their local data from when run as part of a Ray simulation? The background of my question is that we are investigating many (possibly large) datasets in FL settings. The way how I understand this works is that the data are loaded from wherever you start a simulation and then put into Ray's distributed object store. From there, they are sent to whichever client needs the data.

Since in our project, we have a large but consistent collection of datasets and worker nodes with sufficient storage, I would rather copy all datasets to each worker once and then have Ray/Flower load the data from the local storage of each worker instead of loading the data into the object store and sending them over the wire again in each experiment.

So far, I haven't tried very large datasets but our plan is to go to ImageNet-scale and beyond. My concern is that such large datasets would add a considerable overhead if my understanding of how Ray makes data available to workers is correct.

jafermarq commented 1 year ago

So if you plan to do multi-node simulations (which has been possible with Flower since the first version of the VirtualClientEngine) it depends on how you setup your clients. I normally make sure all nodes have all the data and make all clients look to the same data path (e.g. /home/<username>/projects/fl/data). You just need to ensure you set this up before starting the simulation. Then, if that directory exists in all nodes, the clients would just retrieve the data from that path transparently no matter in which node the @remote Task is being executed from. I usually work like that (and you can see I follow that logic when initialising the clients in this repo ).

If duplicating the data is not feasible, you can tell clients to read from some network file system (NFS). Just make sure the client class are parameterised during construction with everything needed. You can do this for larger datasets like ImageNet, GoogleLandmarks, etc. For that then you can use the typical approaches such as Webdataset that optimise your data fetching over the network.

lbhm commented 1 year ago

I see. Storage is not an issue for us so I will just make sure to have a local copy of all the data on each worker node and to parametrize the client class accordingly. Thank you!

lbhm commented 1 year ago

Hey @jafermarq,

As you are currently working on the Ray Actors PR, I wanted to ask you about another Flower-Ray related issue:

By default, Ray co-locates tasks and actors onto a single worker as long as there are available resources on that worker [1]. In our case, we would like to spread the Flower clients/Ray Actors across workers. Ideally, as long as num_clients < num_ray_workers, we would like to have a fixed assignment of Flower Client to Ray worker.

Do you think it would be possible to expose the Ray scheduling strategy as a parameter in Flower simulations? For us, it would help to be able to choose the "SPREAD" string parameter or maybe even better the NodeAffinitySchedulingStrategy.

[1] https://docs.ray.io/en/latest/ray-core/scheduling/index.html

jafermarq commented 1 year ago

Hi @lbhm , sure! exposing the scheduling_strategy option when instantiating the actors can be done easily. I'll add it. NodeAffinitySchedulingStrategy looks interesting as well. What is your particular use case? Assigning certain FL clients to a particular hardware?

lbhm commented 1 year ago

Hardware-specific assignment sounds interesting. So far, I was mainly thinking about the NodeAffinitySchedulingStrategy as a means to improve performance if num_clients <= num_ray_workers and the training datasets are large (i.e., cross-silo FL inspired setups). My assumption was that if you have a fixed worker-client assignment, you run into less thrashing problems if the dataset is larger than your RAM because the same data files (training samples) remain in memory for each worker. Without a NodeAffinitySchedulingStrategy, each client would first need to load their specific data into memory at the start of each round.

jafermarq commented 1 year ago

I see your point. The setting you describe makes sense so I'll find a way to include it. Thanks for the suggestions

jafermarq commented 1 year ago

Just thinking twice about your last sentence. This would still be true with simulation, even if clients are always allocated to the same node. The current implementation (still an open PR, but now running some final tests: https://github.com/adap/flower/pull/1969) uses Actors which persist. They run jobs from a queue, and these jobs are pretty much what we've been doing with Ray tasks: (1) instantiate a client, (2) run the specific method (e.g. fit()). In order to avoid having to materialise the clients for each "task" they are sampled for, I'm thinking that the current best solution would be, specially if you have cross-silo setups with very few clients, to rely instead on non-simulated clients. These persist by default.

jafermarq commented 1 year ago

One option to achieve what you want would be to add some notion of "node state" or "client state" to the Ray Actor itself. In this way, some components needed by clients that the actor spawns are readily available in memory. We are working on adding stateful clients to Flower. So these changes we are discussing could come at a later phase (once the VCE with Actors is released and the first implementation of client states too (https://github.com/adap/flower/pull/2148) ).

Very interesting discussion! Happy to chat more, you can find me on the Flower Slack

jafermarq / FlowerMonthly

Impact of Ray 1.x vs Ray 2.x on Flower simulations #1