gptscript-ai / knowledge

Knowledge for GPTScript
https://gptscript-ai.github.io/knowledge/
Apache License 2.0
24 stars 11 forks source link

Feat: Improve ingestion speed by running more routinues #25

Closed StrongMonkey closed 3 months ago

StrongMonkey commented 3 months ago

We should be running more gorountine to parse and create embedding for documents to speed up ingestion time. Mostly of operation are not heavily relied on CPU. Specially when creating embedding with documents, we don't need to constraint with the number of core system has because we are just waiting for OPEN AI api calls and are not spending cpu/io resource locally.

This also relies on https://github.com/iwilltry42/langchaingo/pull/1.

Tested with new changes and it speed up ingestion time from 400 seconds down to 38 seconds for a 3000 page PDF.

Will re-run e2e tests but this should not impact our ingestion quality.

StrongMonkey commented 3 months ago

@iwilltry42 Also, we need to have this PR merged first https://github.com/iwilltry42/langchaingo/pull/1. Not sure what you are planning on merging that on upstream

iwilltry42 commented 3 months ago

@StrongMonkey once I'm back, I'll create another branch and change bases, so I can have my upstream PR and both of our changes in another branch 👍

iwilltry42 commented 3 months ago

Or faster - let's use your fork for now 🤔

StrongMonkey commented 3 months ago

Ok... once you approve, I will change the go.mod to point to my branch(for at least now)