iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.55k stars 1.17k forks source link

dvc and apache hudi integration #4937

Closed LuisMoralesAlonso closed 3 years ago

LuisMoralesAlonso commented 3 years ago

Does it make sense this kind of integration? We could rely on hudi to manage versions (incrementals this time, so with less storage needs).

hope your comments,

luis

efiop commented 3 years ago

@LuisMoralesAlonso Could you please elaborate?

LuisMoralesAlonso commented 3 years ago
karajan1001 commented 3 years ago

Incremental data, might be related to #331

dmpetrov commented 3 years ago

@LuisMoralesAlonso there are a few more questions...

  • We want to version data here, so we've thought about using Apache Hudi to manage this the most efficient way.
  1. is parquet format the primary format for ML and "datascience"-datahub or you use "less structured" formats?
  2. is Hudi needed for being close to real-time? is close-to-real-time important for ML use cases?
LuisMoralesAlonso commented 3 years ago

answers: 1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed. 2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub. at that time you can use whatever format is needed (will depend mainly on the supported formats for the particular framework you are using). This will be more ephemeral. 3.- We could use petastorm for using parquet from the main DL frameworks, but it's not compatible with hudi. This is something we are asking to the petastorm team too. 4.- for real-time, in the case of online model serving, we have an in-memory grid where we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

LuisMoralesAlonso commented 3 years ago

any comment here?

dmpetrov commented 3 years ago

@LuisMoralesAlonso sorry for the delay.

I'm trying to understand where do you have data versioning already and when it needs to be introduced. So far, it seems like DVC and Hudi have a bit different purposes and I'm trying to understand your scenario (and Hudi) better.

Does Hudi have proper versioning? I'm not a Hudi expert, but it seems like it can efficiently support the latest version but not the whole history.

  • We have a datalake based on Hive + parquet format. All of our use cases will be with this "external data". We've organized this datalakes in several datahubs, based on functional requirements.

1.- we are actually using parquet for all our data lake, so we want to use it as much as possible. at the same time we want to version de features we are using for our ML projects. that's the reason to think about apache hudi as our primary format for this datascience-hub. We want this data governed.

Are you building/deriving features for datascience-hub from the regular tables/datahubs or from some other sources/streaming? Do you have any versioning for regular data hubs/tables?

2.- once we need to train a model in a particular project, we would materialize the features needed from the datascience-hub....This will be more ephemeral.

Would you like to create a version of Hudi "table" by a request?

4.- for real-time, ... we will have the features replicated (or will be calculated) to coordinate both the training and the different serving options (inference).

It is usually done with real streaming. I thought that Hudi cannot handle this level of latency but I'm not an expert in Hudi.

PS: It can be way more efficient to schedule a chat - please feel free to shoot me an email to my-first-name at iterative.ai or DM at https://twitter.com/fullstackml

efiop commented 3 years ago

Closing as stale. Please feel free to reopen.