i-am-bee / bee-api

API backend for Bee
Apache License 2.0
20 stars 8 forks source link

feat(extraction): Docling #49

Open pilartomas opened 2 weeks ago

pilartomas commented 2 weeks ago

Use Docling to perform document extraction.

  1. Take the Unstructed PoC and replace Unstructured with Docling.
  2. Store DoclingDocument in it's raw form, markdown form and chunked form in S3.
  3. Update extraction utilities to use these representations.
  4. Update Dockerfile, make sure the bootstrap data (loaded by the Docling on the first used) are stored in the image.
pilartomas commented 2 weeks ago

Note that I will further modify the Unstructed PoC to support polymorphism over extraction outputs. That way the Docling extraction backend will be able to coexist with others (WDU, Unstructured).

PeterStaar-IBM commented 2 weeks ago

@pilartomas As discussed, we will add docling by updating

  1. src/files/entities/helpers.ts
  2. workers/python/python/extraction
pilartomas commented 2 weeks ago

When trying out Docling, I noticed that it downloaded several resources on the first run after installation. If that is really the case, please make sure this is done while building the docker image.

PeterStaar-IBM commented 2 weeks ago

yes, we will do that. It is the same we do with other customers.