made training call more robust.

vanna-ai / vanna

🤖 Chat with your SQL database 📊. Accurate Text-to-SQL Generation via LLMs using RAG 🔄.

https://vanna.ai/docs/

MIT License

9.97k stars 737 forks source link

made training call more robust. #370

Closed thisismygitrepo closed 1 month ago

thisismygitrepo commented 3 months ago

If you have extremely large number of sql / docs/ plan / examples etc (typically above thousands). The probability of having this error becomes very large (inevitable):

HTTPSConnectionPool(host='ask.vanna.ai', port=443): Max retries exceeded with url: /rpc (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1006)')))

I solved this problem by making a robust call. There are libraries to do that, but instead of making project more complex with dependencies, I added my implementation.

Note: I made train(plan=plan) fix only. But the same needs to be done for all other sections of the train method (i.e. whereever there is a loop of api calls)

thisismygitrepo commented 3 months ago

while at the for loops, one should consider adding a progress visualizer. With thousands or more items, the user has no clue if the app is hanging or is it making progress in training. I recommend tqdm unless there is something simpler.

zainhoda commented 3 months ago

Thanks for this -- I think you're the first user to experience this. I'd be curious how your experience was after you trained? In most other cases we usually recommend that people "start small" with a specific subset of data and then expand gradually as the accuracy improves.

thisismygitrepo commented 3 months ago

I made the same conclusion as yours, it means I'm the first one to try out on massive sql database. To add context, I have a department of health (state-wide) database with 4k tables that is a spagetti monster and the provided train methods fail (all of them give the max_retry error due to large number of calls.).

To your question, it worked on simple queries, but for seriously complex stuff that involves signifcant amount of corporate knowledge (e.g. how many patients with dxg insulin results exceeded that level provided they went to service x over the past three months in facility y) this is when it starts to crack (using GPT4 turbo). I'm thinking more context window would improve it judging by the simple errors its making (like column doesn't exist).

I'm not sure if you are hinting at me more data may reduce accuracy.

zainhoda commented 1 month ago

After looking at this again -- nobody else has come across this issue. I think the right place for this would potentially be not in VannaBase but rather in the specific implementation of the vector database interaction.

Closing for now. Feel free to reopen if the issue comes up again.