microsoft / genaiscript

Generative AI Scripting
https://microsoft.github.io/genaiscript/
MIT License
127 stars 27 forks source link

Pdf generated from web page isn't parsed correctly. #293

Open bzorn opened 5 months ago

bzorn commented 5 months ago

[Low priority]

I generated the attached pdf file which I generated from the GenAIScript transparency note page in the Edge browser. It renders correctly but when I the sample script "summarize-pdf.genai.js" it doesn't show any page contents in the trace view of the input pdf file:

image

genaiscript-transparency-note.pdf

pelikhan commented 5 months ago

can you attach the pdf?

pelikhan commented 4 months ago

It seems that Edge rasters the entire page so there's no text to extract. One would have to run OCR on the PDF.

image
pelikhan commented 4 months ago

Add OCR library to support this scenario.

pelikhan commented 4 months ago

https://github.com/naptha/tesseract.js#tesseractjs

pelikhan commented 4 months ago

https://www.npmjs.com/package/pdf-to-img

pelikhan commented 4 months ago

use gpt4-v for OCR engine