scambier / obsidian-text-extractor

A (companion) plugin to facilitate the extraction of text from images (OCR) and PDFs.
GNU General Public License v3.0
346 stars 19 forks source link

[Feature request] how to batch many .png files? #14

Open ccchan234 opened 1 year ago

ccchan234 commented 1 year ago

Is your feature request related to a problem? Please describe.

I got tons of files, now TE need to be done one file by one file.

Describe the solution you'd like

select several files, Rt click, choose extract to separate files, then extracted to separate files. (may be some people also want extract ALl to 1 single file but please add filename into the 1 single documents thx)

Describe alternatives you've considered

in the form of command

Additional context

ccchan234 commented 1 year ago

i have to say TE is very accurate for me, with screenshots taken for pastest MCQ questions.

thanks

danielo515 commented 10 months ago

I also find a bit confusing how to use this plugin. I was expecting some command to scan all the images and generate cache from them,or as this issue states, a whole folder. Is this even possible?

scambier commented 10 months ago

Text Extractor was first and foremost built as a sort of "plugin's plugin". The idea was to provide a few basic helper functions for developers to build or expand their own plugin on top of it. Though to my knowledge, it's not used by anything else than Omnisearch.

I was expecting some command to scan all the images and generate cache from them

What is your use case?

danielo515 commented 10 months ago

My usecase is to make all the text on my images available for search with omnisearch. I want to execute them all so I can leverage the cache on mobile

El jue, 4 ene 2024, 13:06, Simon Cambier @.***> escribió:

Text Extractor was first and foremost built as a sort of "plugin's plugin". The idea was to provide a few basic helper functions for developers to build or expand their own plugin on top of it. Though to my knowledge, it's not used by anything else than Omnisearch.

I was expecting some command to scan all the images and generate cache from them

What is your use case?

— Reply to this email directly, view it on GitHub https://github.com/scambier/obsidian-text-extractor/issues/14#issuecomment-1876989484, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKJWP2EUCOLEUTHTKX7DDYM2LLZAVCNFSM6AAAAAAUOUZOKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZWHE4DSNBYGQ . You are receiving this because you commented.Message ID: @.***>

scambier commented 10 months ago

Ok so you just need to enable images and pdf indexing in Omnisearch settings on a desktop PC. Omnisearch will ask Text Extractor to get the text for all those files, and that will generate the cache 👍

danielo515 commented 10 months ago

Ok, thanks. I think I have that enabled, but I will double check

El jue, 4 ene 2024, 18:25, Simon Cambier @.***> escribió:

Ok so you just need to enable images and pdf indexing in Omnisearch settings. Omnisearch will ask Text Extractor to get the text for all those files, and that will generate the cache 👍

— Reply to this email directly, view it on GitHub https://github.com/scambier/obsidian-text-extractor/issues/14#issuecomment-1877488460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKJWLTSMJZ4OVZHMLRQ4TYM3QXBAVCNFSM6AAAAAAUOUZOKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGQ4DQNBWGA . You are receiving this because you commented.Message ID: @.***>

paulpall commented 5 months ago

Ok so you just need to enable images and pdf indexing in Omnisearch settings on a desktop PC. Omnisearch will ask Text Extractor to get the text for all those files, and that will generate the cache 👍

I'm not sure if I have missed anything but I can't seem to get this to work with images either. PDF content seems to have been indexed, but with images I have to manually right-click and extract text to clipboard for each image to show up in search.

I had a look at the logs and there were a lot of Text Extractor - OCR Worker timeout _imagename eval @ plugin:text-extractor:5068 messages... I'm on an ARM macOS laptop, perhaps there's some conflict stemming from that?

Perhaps a workaround could be a buttton in the settings to ignore timeouts and have it index all the images automatically? Even if it does takes hours, as long as there's a way to keep an eye on the progress, I wouldn't mind.

scambier commented 5 months ago

@paulpall

Perhaps a workaround could be a buttton in the settings to ignore timeouts and have it index all the images automatically? Even if it does takes hours

That's what is happening already, when Omnisearch uses Text Extractor, as long as this is enabled. image

But if you have many images that cause a timeout (maybe they're particularly large or too complex for the OCR library), the worker is effectively blocked 120 seconds on a single image, and then blocked again on the next image, etc.

Eventually it will go through all of them though, as images are only treated once, even when they timeout.