WIP/POC for PDF -> MD file conversion using Deep Search
Issue #2
While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.
All of this code is inferring the API from the client typescript lib in src/lib/api/deepsearch/index.ts. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to the getDocumentHashes DS4SD library code removing the / before api/xxx in path:api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.
The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:
Upload component from the client side sent to the server side rendering.
Implement a POST endpoint to handle PDF to Markdown conversion
Authenticate the user with the username and API key
Launch the conversion task and wait for its completion with the status/wait API call. The API server mock go binary adds a 15 second delay, enough for the client side to send a second GETv2/project/mockProject/celery_tasks/mock-task-id.
Retrieve the transaction ID from the completed task
Fetch document hashes using the transaction ID
Obtain document artifacts using the document hash
Return the document artifacts as the API response to the nextjs client side for rendering the results.
To start the Go mock deep search api server do the following:
cd mock/go-deepseed/
go run deepseed-mock.go
Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the src/app/api/conversion/route.ts code in this PR.
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/celery_tasks/mock-task-id?wait=10 \
-H "Authorization: mock-token"
Get Document Hashes
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/transactions/mock-transaction-id \
-H "Authorization: mock-token"
Get Document Artifacts
curl -X GET http://localhost:8080/api/cps/public/v2/project/mockProject/data_indices/mockIndex/documents/mock-document-hash/artifacts \
-H "Authorization: mock-token"
Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.
WIP/POC for PDF -> MD file conversion using Deep Search
Issue #2
While the endpoint is getting established, this takes the client side typescript library and mocks up a Go gin server for mocking the file conversion API responses.
All of this code is inferring the API from the client typescript lib in
src/lib/api/deepsearch/index.ts
. It will certainly require some tuning once the endpoint gets stood up. There was one adjustment I had to make to thegetDocumentHashes
DS4SD library code removing the/
beforeapi/xxx
inpath:
api/cps/public/v2/project/${projKey}/data_indices/${indexKey}/documents/transactions/${transactionId}`` to prevent a double slash and a 404 from being returned. Typo maybe, tbd.The UI components call the mock Deep Search API for PDF to Markdown conversion with the following operations with NextJS SSRs:
v2/project/mockProject/celery_tasks/mock-task-id
.To start the Go mock deep search api server do the following:
Here are the CURL commands for functional testing and to validate the API calls. Also listing them to add clarity to the operations in the
src/app/api/conversion/route.ts
code in this PR.Here is a screencap of a basic upload. Once we get a feel for the conversion times we can consider integrating it directly into the knowledge submission form that will render the PDF doc to MD directly in the upload that then kicks off the process to get it posted in a repo and supply the location+SHA of the docs.
https://github.com/instructlab/ui/assets/1711674/ed347bd2-2c2f-4c76-95a2-da0340237aeb