run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 729 forks source link

Discuss: Support Llamaindex connector for MeltanoHub, which has 500+ open source Singer data connectors #242

Open aaronsteers opened 1 year ago

aaronsteers commented 1 year ago

A generic interface into hub.meltano.com would be great. In that paradigm, the source connectors are called "extractors" or "taps".

There are a few different ways we could create generic connection interfaces, which I can highlight below...

Generally though, each connector would need:

  1. The connector {variant}/{name} string combo, and/or pip_url of the connector.
  2. The config info for the connector, which is generally passed as a JSON file, but which could be defined by users in a Python dictionary object or an array of key-value pairs.
  3. Optionally: the stream and property selection rules, either as a glob of inclusion/selection rules, or as a Singer "catalog" JSON artifact. Perhaps not needed in a V1, but these could let users pick and choose which datasets and/or properties they are interested in.
    • A simple "V1" MVP might ask for a single stream name, or an array of stream names.

An example:

tap-asana - Meltano Hub

Connector info:

Sample config:

asana_config = {
    "client_id": os.environ.get("TAP_ASANA_CLIENT_ID"),
    "client_secret": os.environ.get("TAP_ASANA_CLIENT_SECRET"),
    "refresh_token": os.environ.get("TAP_ASANA_REFRESH_TOKEN"),
}

Processing Singer output

Singer outputs data as a series of json lines, generally one record which should be easy for the libraries to parse generically.

List of connectors:

https://hub.meltano.com/extractors

This isn't a full list, since many are being created that aren't already on the Hub, but it gives a good idea of the existing depth and breadth of the ecosystem.

How to list on LLama-Hub

To not spam the index, we could just list as a single item on the LlamaHub: either as "MeltanoHub Singer Taps", or "Singer Extractors" generically, or similar.

Thinking about the "right" abstraction layer

I think this could be really powerful, since it could plug in Llamaindex, Langchain, and other GPT-like applications into a broad ecosystem of already existing connectors.

Since the vast majority of Singer connectors are already pip installable, this should fit well with existing paradigms that Llamaindex is using.

I may have some cycles to contribute to this integration but I first wanted to log this issue here to assess interest level, and discuss if there are any potential pitfalls or "gotchas" that others might see.

tayloramurphy commented 1 year ago

We'd be eager to chat about how we could help with this. The wider Singer community has put a lot of work into extractors and if we could make it easier to use them for LLM applications that'd be a huge win for everyone!