instructlab / ui

Place to hack on UI for InstructLab
Apache License 2.0
2 stars 5 forks source link

POC: Deep Search PDF to MD file conversion #33

Open nerdalert opened 5 days ago

nerdalert commented 5 days ago

WIP/POC for PDF -> MD file conversion using Deep Search

Issue #2

While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.

All of this code is inferring the API from the client typescript lib in src/lib/api/deepsearch/index.ts. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to the getDocumentHashes DS4SD library code removing the / before api/xxx in path:api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.

The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:

To start the Go mock deep search api server do the following:

cd mock/go-deepseed/
go run deepseed-mock.go

Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the src/app/api/conversion/route.ts code in this PR.

  1. Authenticate
curl -X POST http://localhost:8080/api/cps/user/v1/user/token \
    -H "Content-Type: application/json" \
    -H "Authorization: Basic $(echo -n 'your-username:your-api-key' | base64)"
  1. POST a PDF to convert
curl -X POST http://localhost:8080/api/cps/public/v1/project/mockProject/data_indices/mockIndex/actions/ccs_convert_upload \
    -H "Authorization: mock-token" \
    -H "Content-Type: application/json" \
    -d '{
          "file_url": ["data:application/pdf;base64,<your-base64-pdf>"]
        }'
  1. Wait for the Task ID to return
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/celery_tasks/mock-task-id?wait=10 \
    -H "Authorization: mock-token"
  1. Get Document Hashes
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/transactions/mock-transaction-id \
    -H "Authorization: mock-token"
  1. Get Document Artifacts
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/mock-document-hash/artifacts \
    -H "Authorization: mock-token"

Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.

https://github.com/instructlab/ui/assets/1711674/ed347bd2-2c2f-4c76-95a2-da0340237aeb

nerdalert commented 9 hours ago

Latest demo screen cap https://drive.google.com/file/d/1dal2nQHWr3ye1E_zCpIJzyqaOEy84Id5/view