opendatahub-io-contrib / data-mesh-pattern

Data Mesh Pattern
https://opendatahub-io-contrib.github.io/data-mesh-pattern
Apache License 2.0
27 stars 15 forks source link

Foundation Models Integration - Data Contributions #71

Open caldeirav opened 1 year ago

caldeirav commented 1 year ago

Date mesh pattern should provide a way for data product owner to contribute curated data for LLM training. A good approach and reference is the datalake approach for gpt4all:

https://github.com/nomic-ai/gpt4all-datalake

neoxu999 commented 1 year ago

The gpt4all-datalake has provided the API for contributed the data. https://api.gpt4all.io/v1/ingest/chat { "source": "gpt4all-chat", "submitter_id": "EliteHacker#42", "agent_id": "gpt4all-j-v1.2-jazzy", "ingest_id": "string", "conversation": [ { "content": "Hello, how can I assist you today?", "role": "assistant", "rating": "negative", "edited_content": "Hello, how may I assist you today?" }, { "content": "Write me python code to contribute data to the GPT4All Datalake!", "role": "user" } ], "prompt_template": "string" }

I compared different vector databases, Weaviate, Pinecone and Chroma Weaviate vector database has native REST API for creating objects, very convenient, worth to try. https://weaviate.io/developers/weaviate/api/rest/batch

For search, Weaviate's GraphQL API are very useful for integration https://weaviate.io/developers/weaviate/api/graphql

Data product owner can easily submit their data to Weaviate vector database.

neoxu999 commented 1 year ago

Hi @caldeirav,

I'd like to install Weaviate Vector database on Red Hat AI and show examples how to send data to Weaviate. What do you reckon?

Many thanks, Neo

caldeirav commented 1 year ago

@neoxu999 Weaveviate looks like a good candidate - I think the key is to ensure we can integrate the vector database with our MLOps automation first and foremost and once this is successful, we can start looking at data contributions and data tracing / lineage requirements in details first.

caldeirav commented 1 year ago

@neoxu999 Do you think it is possible to introduce Weaveviate into the Data Mesh pattern deployment now? As we are installing a new instance, we can then start to run simple examples such as the ones in the OpenAI playbook, before we introduce our own training pipeline.

Reference: https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases

neoxu999 commented 1 year ago

@caldeirav good to know we have a new instance. Yes, I can try the OpenAI playbook before installing Weaveviate on Data Mesh Pattern.