h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.42k stars 1.25k forks source link

SQL DB #357

Open pseudotensor opened 1 year ago

pseudotensor commented 1 year ago

https://arxiv.org/abs/2306.03341 https://arxiv.org/abs/2212.14024

https://bird-bench.github.io/

https://dev.to/ngonidzashe/speak-your-queries-how-langchain-lets-you-chat-with-your-database-p62 https://github.com/imartinez/privateGPT/issues/616

https://musings.yasyf.com/compressgpt-decrease-token-usage-by-70/

https://github.com/vnk8071/E2E-AI-Chatbot

https://github.com/dorianbrown/rank_bm25

https://github.com/ocrmypdf/OCRmyPDF

https://github.com/h2oai/helium/issues/8

https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings

https://github.com/styczynski/chatdb https://github.com/chat2db/Chat2DB

https://arxiv.org/abs/2306.03901

https://github.com/questdb/questdb

pseudotensor commented 1 year ago

https://h2oai.slack.com/archives/C050PKJ6GAX/p1689022427466269

For SQL, we can try flan-t5-xxl or Flan-UL2 and fine-tuning it on the Bird dataset. Those models are known to do surprisingly well compared to much larger models: Even t5-3B does reasonably: https://bird-bench.github.io/#:~:text=BIRD%20(BIg%20Bench%20for%20LaRge,total%20size%20of%2033.4%20GB. https://declare-lab.net/instruct-eval/ https://medium.com/@bnjmn_marie/behind-the-hype-models-based-on-t5-2019-still-better-than-vicuna-alpaca-mpt-and-dolly-6c4f1139f39e

Note that flan models have 2048 input context and 512 output context for sequence to sequence. Should be enough for many cases, although fine-tuning on different output context length is possible.

https://huggingface.co/datasets/wikisql https://paperswithcode.com/dataset/kaggledbqa

pseudotensor commented 1 year ago

SQL: https://huggingface.co/datasets/wikisql SQL: https://paperswithcode.com/dataset/kaggledbqa (https://github.com/chiahsuan156/KaggleDBQA/blob/main/KaggleDBQA_tables.json) SQL: https://huggingface.co/datasets/NumbersStation/NSText2SQL SQL: https://huggingface.co/datasets/sede

NidhiMehta commented 1 year ago

https://huggingface.co/datasets/b-mc2/sql-create-context

NidhiMehta commented 1 year ago

https://huggingface.co/datasets/spider ref : https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehensive-case-study-for-tailoring-models-to-unique-applications

pseudotensor commented 1 year ago

https://github.com/defog-ai/sqlcoder

pseudotensor commented 1 year ago

https://medium.com/dataherald/fine-tuning-gpt-3-5-turbo-for-natural-language-to-sql-4445c1d37f7c

rkeshwani commented 1 year ago

Uhh, I realise this is a dump for enabling this functionality but seeing as h20gpt integrates with langchain, thought I would post this here.

https://python.langchain.com/docs/use_cases/qa_structured/sql

Are there any plans to implement any functionality to support SQL databases? Langchain integrates with SQLAlchemy from my understanding so you could provide support for various databases and let the user supply the connection string or credentials and host?

pseudotensor commented 1 year ago

There are no immediate plans to enable, but a PR is welcome :)

There is also a PR still WIP for elastic search that seems to function, just needs exposed in UI etc. https://github.com/h2oai/h2ogpt/pull/656