swissspidy / media-experiments

WordPress media experiments
GNU General Public License v2.0
87 stars 1 forks source link

OCR #403

Open swissspidy opened 7 months ago

swissspidy commented 7 months ago

PDF.js could be combined with https://tesseract.projectnaptha.com/ to do OCR on uploaded PDFs. Just needs a good use case.

Apparently the underlying Tesseract models haven't been updated in a while, so maybe need to find alternatives.

swissspidy commented 3 months ago

PDF.js can actually extract text from PDFs already. So might be more useful for images.

swissspidy commented 3 months ago

For images it could be interesting to extract text during upload and then store that as metadata. Useful for searching the media library.

swissspidy commented 2 months ago

Related: #647