google / ml-metadata

For recording and retrieving metadata associated with ML developer and data scientist workflows.
https://www.tensorflow.org/tfx/guide/mlmd
Apache License 2.0
603 stars 137 forks source link

Support for NoSQL databases? #17

Open jinnovation opened 4 years ago

jinnovation commented 4 years ago

Currently, ML Metadata requires transactional databases, which seems to preclude using NoSQL databases such as Cassandra or older versions of MongoDB as storage layers for ML Metadata. This precludes my team at Twitter from using preexisting infrastructure as a storage layer for TFX metadata.

Are there any plans to support NoSQL databases in the future? If so, when?

zhitaoli commented 4 years ago

HI @jinnovation, sorry to hear this. At this point, we really want to make sure that MLMD uses a backend which has transaction support, so we can ensure data consistency among failures. So we see a couple of options:

  1. is there possibility you can find some storage systems supporting transactions? If the system uses SQL the work to support that would be much smaller, otherwise we can do more investigations to understand how much work is necessary;
  2. MLMD already supports SQLite (on single machine). Depending on operation procedures, I wonder whether we can find a workable approach to use SQLite to back MLMD;
  3. we have an engineer who is MLMD into Kubeflow and make this possible on Kubernetes. If you can use Kubernetes this may be another option.
  4. If you can use Google's (or any other public cloud's managed SQL), we can look to expand the connection config to support connecting to a hosted MySQL/Spanner/Postgres.

We are happy to further discuss this, and welcome contributions if we can agree on a path listed above which can unblock your team.

jinnovation commented 4 years ago

Thanks for the suggestions @zhitaoli. I realized that my previous comment was slightly misleading, so I wanted to clarify my and my team's motivations. ​ Specifically, we'd like to use Manhattan, our internal NoSQL storage system, as a backing layer for ML Metadata. As such, what we're looking for is not so much support for any specific 3rd-party NoSQL system, but rather for generic, custom backends. To add to that, we currently have a metadata-store component that's backed by Manhattan; our higher goal is to unify metadata storage. ​ Hope that helps.

zhitaoli commented 4 years ago

@jinnovation, thanks for the pointer.

Given that Manhattan seems like a closed-source project, I imagine this cannot be done from our side but has to remain a closed source extension to MLMD.

The idea of allow custom backend handler has surfaced with my sync with @hughmiao /etc, and he can assess how much work it is to enable a plugin-ish design for injecting a different storage backend support. Otherwise, you would have to fork MLMD and maintain the storage intergration, which is certainly unpleasant and bears the risk of further drift.

One thing I would suggest checking out is whether the storage system you choose has transaction support: a lot of functionalities of MLMD as well as workflows TFX::OSS built on top relies on atomically creating/updating multiple entities in MLMD in one transaction. If that is not possible, system could be left in corrupt state and very difficult to self-heal or recover. You don't need to disclose this information to us but you definitely should understand the risk with this.

hughmiao commented 4 years ago

thanks for your interests, @jinnovation. It's great to know that the effort of unifying the metadata model and storage. We are towards the same goal here and excited to get to know your work.

To extend MLMD, here're some general comments about the overall framework and extensible layers which may be useful for the community. I also left some thoughts on your specific case at the end.

At the user facing layer, MLMD provides a unified data model and a set of APIs, which are defined here, implemented in C++ (server and library) and swigged for different client languages (python, go). As long as the API and data model is unified, orchestration, analytics tooling can be shared.

The implementation details for the set of APIs and data model is via two additional layers:

Each layer is extensible. For example, for the backend persistent layer. supporting a new relational and transactional backend only requires extends the metadata_source, and fix some access layer query dialects. An extension like that are several hundreds lines of C++. e.g., sqlite, mysql.

If an new backend is not relational, and no declarative language layer (e.g., no SQL support for the nosql backend here), then extending the domain object access layer is needed, i.e., implementing those CRUD calls for domain data models such as FindTypeById, CreateArtifact ...

If the backend primitives and data organizations do not fit well with the list of calls in the domain object access layer, then reimplementing the APIs by extending the high level store interface is needed. By doing so, at least you can reuse the tests, the swigging for client libraries, grpc server and release scripts.

Back to the specific NoSQL backend without transaction support, as illustrated above, it is possible to extend the domain object access layer, or extend the store directly and drop the atomicity guarantee of MLMD APIs. One concern is that it may hurt the utility due to the dirty/partial metadata ingested in the store. It now becomes a metadata ingestion clients' problem to ensure the data consistency and clean up when needed. As @zhitaoli mentioned, one usage of MLMD is served as a backend for distributed components for TFX pipelines. The ingestion happens during pipeline runs, and the ingested artifacts and executions states need to consistent for the correctness of the orchestrator. If you also intend to use MLMD together with TFX or other workflow orchestrators, lack of transaction capability of the backend may need to be considered beforehand.