Distributed processing and scaling up

rom1504 commented 1 year ago

Hello!

I like the goals you have with fondant. I also believe data processing at scale is quite important for ML. I have similar goals with img2dataset, clip-retrieval, video2dataset and cc2dataset and these tools worked pretty well at scale.

As you've seen, a lot of different filtering and transformation are possible and making those modular and reusable is nice. It's true for text and images and it's even more the case for bigger modalities such as video, 3d and bio data.

You made the choice here to package everything with docker and use kubeflow and dask and have a well defined component structure. I find the architecture to be quite similar to https://github.com/jina-ai/jina

I am wondering what you found in term of

speed to turn up a pipeline: if you use N components each using its own docker file, is that still fast?
overhead of docker: does each component use some minimum of ram?
scaling up: in my tools I've tried to keep things as minimal as possible to be able to scale to billions (or even trillions in the case of cc2dataset) of samples. I've found that optimizing for network, cpu and ram constraints usually forces some specific designs (eg for img2dataset shuffling the dataset beforehand to avoid killing external hosts, having worker which each has a thread pool to have different levels of parallelism, ...). How are things working for you there? Is dask reliable enough? is kubeflow able to handle large scale ? I'm particularly curious if you think your architecture will keep working at any scale or if it will need to be adapted and where

Thank you for any insights!

PS: maybe this should go to the discussion tab; feel free to move it there if it makes more sense

RobbeSneyders commented 1 year ago

Hi @rom1504, thanks for starting this discussion!

I like the goals you have with fondant. I also believe data processing at scale is quite important for ML. I have similar goals with img2dataset, clip-retrieval, video2dataset and cc2dataset and these tools worked pretty well at scale.

Thanks for your work on those! We've used some of your tools and included them in some of our components. We opted to copy in part of the tools instead of using them as dependency because of some dependency conflict issues. I'll open an issue on one of your repos with more info.

You made the choice here to package everything with docker and use kubeflow and dask and have a well defined component structure. I find the architecture to be quite similar to https://github.com/jina-ai/jina

There's indeed some similarities with Jina, but I think the angle is a bit different. Jina focuses on hosting your models, and allows you to chain them into pipelines. Fondant focuses on reusable data processing, with the possibility to host models inside components. It will be interesting to see how much those angles converge.

Our focus on reusable data processing is also the reason for docker and our well defined component structure. It helps with the interoperability and reusability of the components.

I am wondering what you found in term of

speed to turn up a pipeline: if you use N components each using its own docker file, is that still fast?

There's a couple of aspects to this:

At runtime, there's minimal overhead from running the code in docker containers. So if you're starting from already-built containers, this overhead will become negligible even at smaller data sizes.
At build time, there is more overhead, since building images can take time depending on the Dockerfile. There's two ways this impact should be limited though:
- The goal of Fondant is to make data processing reusable, and offer a component hub with pre-built components that you can plug into your pipeline without building.
- The Dockerfiles of the components are built to leverage caching, so iterations on component code build quickly.
I would say the biggest overhead is complexity.
- You need an orchestrator like docker compose or kubeflow pipelines, which might require a cluster.
- The development flow becomes more complex since you need to build your docker containers and make them accessible to your orchestrator, which might run remotely.
We try to make this as easy as possible though. We have invested a lot in our local runner (which uses docker compose) so you can easily test and run pipelines on your own machine or a single VM, so you don't need a cluster to get started. It also builds your docker images for you, removing the complexity in the development flow. We're currently also working on supporting Vertex AI and Sagemaker Pipelines as orchestrators, so you no longer need to set up a cluster to move beyond the local runner.

overhead of docker: does each component use some minimum of ram?

When running on Linux, there is virtually no overhead. On Mac or Windows there might be, but that's not really relevant for scaling.

The only place where there might be overhead is on the networking side. We've noticed that we don't achieve the same performance downloading images as img2dataset, but we haven't had the time to properly investigate where the issue lies. We think it might be due to docker, but if it is, it might be resolved by proper network configuration (although that might not be possible on every orchestrator).

If you want to get a feel yourself for the points mentioned above, I recommend giving our local runner a spin (see our Getting started docs.

scaling up: in my tools I've tried to keep things as minimal as possible to be able to scale to billions (or even trillions in the case of cc2dataset) of samples. I've found that optimizing for network, cpu and ram constraints usually forces some specific designs (eg for img2dataset shuffling the dataset beforehand to avoid killing external hosts, having worker which each has a thread pool to have different levels of parallelism, ...). How are things working for you there? Is dask reliable enough? is kubeflow able to handle large scale ? I'm particularly curious if you think your architecture will keep working at any scale or if it will need to be adapted and where

Design

Proper design will still be important to be able to scale pipelines and components. We've seen this for instance when extracting URLs from common crawl, where we had to combine a lot of steps into a single component to prevent large data movement, while smaller parts of the flow could have been reusable for other use cases. Or when doing global deduplication, where we had to cluster the data in a first component, to then do local deduplication per cluster in a second component, splitting a single logical step into two components.

Supporting nested pipelines (including a pipeline as a component in a larger pipeline) might offer some benefits here in the future. But mainly the reusability of components will. Since a component can be implemented once and reused many times, it only needs to be designed properly once.

Dask

We currently get two things from Dask:

Larger than memory data processing: You lazily define your transformations, and they will be executed on the data partition by partition.
Parallelization across cores: Since execution is split by partition, they can easily be parallelized and Dask handles this either using processes or threads depending on config.

You can see both of these combined in our PandasTransformComponent, where you only need to implement your transform function on a single Pandas dataframe. We use Dask to load the data in chunks as Pandas dataframes and execute them in parallel. See this simple example.

Concurrency on a single core is not handled by Dask, and currently needs to be implemented manually, but we might offer an abstraction for this in Fondant in the future.

Dask might not always be the best choice for every data type or transformation, so we made sure to encapsulate the Dask-related code in specific dataIO classes. This allows us to support additional frameworks in the future.

Kubeflow pipelines

Kubeflow pipelines is only one of the orchestrators that we support, but we can use it as an example, as the other orchestrators we (will) support are very similar. The defining features are:

They orchestrate docker containers. We basically compile a fondant pipeline to a Kubeflow pipeline, which we then execute. Data is read and written in each container, and only data locations are passed between containers (this is different from eg. Spark or similar frameworks).
Each container is executed on a single machine. The machine to execute on can be chosen per component. But distributed executing of a component across a cluster of machines is not supported.

This is also currently the limit to our scaling.

We automatically parallelize execution across cores of a single machine.
Concurrency on a single core needs to be implemented manually, but abstraction can be provided by Fondant in the future
Distributed execution across machines is currently not yet possible

I can see paths to move beyond this single-machine limit though:

Parallelize within a single component container: Offload to a remote cluster (eg. Dask, Spark, ...)
Parallelize component across containers: Fondant handles the parallelization and runs multiple containers on parallel branches in the orchestrator. Each has its own up- and downsides, so it's not clear to me yet which way we'll go.

This became quite long :sweat_smile:, but it was helpful for myself as well to write this down in a (hopefully) structured way. Looking forward to your feedback.

RobbeSneyders commented 1 year ago

Turning this into a discussion so we can track the progress towards distributed execution in #549

ml6team / fondant

Distributed processing and scaling up #500