Indexing: plaintext + PDF

untitled-pit-group / foxhound

PIFS standard backend

BSD Zero Clause License

0 stars 0 forks source link

Indexing: plaintext + PDF #7

Open paulsnar opened 2 years ago

paulsnar commented 2 years ago

This background job downloads a requested file from GCS, OCRs it if necessary (i.e., if it's not plaintext already, using Tesseract/Leptonica), then submits the plaintext to indexing (after which the plaintext can be discarded.)

Checkpoint progress (downloading, OCRing, indexing) should be reported via job infra that's yet to be defined.

paulsnar commented 2 years ago

Plaintext done in e9be656ce6d86c9e25e4232a7c2ea008bc900a4b.