Consider integrating OCR support

laurent22 commented 6 years ago

It seems possible to add support for OCR content in Joplin via the Tesseract library: http://tesseract.projectnaptha.com

A first step would be to assess the feasibility of this project by integrating the lib in the desktop app and trying to OCR an image.

Is the image correctly OCRed?
Does it work with non-English text?
How slow/fast is it? Test with a very large image to be sure. It should not freeze the app while processing an image.

If everything works well, we can add the feature to the app.

Specification

On desktop app: Create service that runs in the background and process the resources that need to be OCRed.
When a document is OCRed: Append block to end of note that contains the extracted plain text
When attaching resource, ask what user wants to do:
- Always OCR all files
- Never OCR any file
- Always OCR files with extension ".ext"
- Never OCR files with extension ".ext"
Can be changed in settings
Right-click resource, or note, to OCR content
Add resource ocr_status on resource table: Can be: none, todo, processing, done
Add ocr_text to resource: must include detailed coordinates, and a way to get plain text back

Advantage of it doing that way:

Search engine just works - no need for special indexing of OCR content since it is inside the note directly
Will work with all clients (mobile, desktop, terminal)
When a note is exported to Markdown, it will include the OCR content

Format of OCR text block

<!-- autogen-ocr :resource.id -->
* * *

**:resource.title**

:resource.ocr_text
<!-- autogen-ocr :resourceId -->

For example, for a resource called "TrainTicket.png":

<!-- autogen-ocr 2ee4eec909734f7197654a9a040dfba7 -->
* * *

**TrainTicket.png**

From: London
To: Paris
Date: 01/12/2019
Time: 15:00
...etc.
<!-- autogen-ocr :resourceId -->

The advantage of this format is that it will render nicely in the viewer, and it will still be clearly identified as OCR content, which means later we can identify these blocks, update them, remove them, etc.

Later

Support PDF files - for example by converting each page to an image first, then passing it to Tesseract.
Make ocr_text searchable
Display search results directly on document. i.e. if it's an image, highlight the parts of the image that contain the search text.

More considerations

Don't add the text directly to the note, that's going to be messy if there is a lot of text
Instead save it only to ocr_text
For searching , when building the notes_normalized table, append the content of any attached resource text, so that it's indexed by the search engine

bmix commented 6 years ago

I would recommend against it, because adding major features like these require a lot of development time, that may be better spent on the core features of the app.

There are already capable applications, implementing Tesseract on all major platforms:

gImageReader - can export to hOCR, which is a micro format for OCR export via HTML, see also git repository and obligatory Wikipedia entry. An importer would need to be written, since this HTML is very structured and contains data, that is redundant for Joplin.
then, while not OCR, but PDF extraction, there is also
- Apache's pdfbox, which is available on all major platforms and can extract to HTML (I think even XHTML!)
- and Poppler, also available on all major platforms, this exports to XML or HTML, I do not remember. Again, just importers would need to be written.

(I use all three programs on Windows and they are very mature and powerful)

EDIT: Isn't the main feature of Evernote's OCR the recognition of handwriting? AFAIK Tesseract doesn't do that. It's still considered some magic wizardry, which may need performant systems and (patented?) algorithms. I wouldn't be surprised, if Evernote did such OCR on their servers in the cloud.

bernd-wechner commented 6 years ago

I concur with bmix. Not that OCR is a BAD idea, only that it seems a distraction when so many other things are calling and not a priority in the least given how many OCR options I have outside of Joplin. There are many features I'd prefer to see before this one!

laurent22 commented 6 years ago

If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.

Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.

StrilGit commented 6 years ago

I would pay a lot to get handwrite-recognition and OCR. For me, that is the most critical feature to make Joplin a real replacement for Evernote, Onenote, etc. I am sadly not able to add this...

I do not want to use a keyboard in meetings. Using a pencil better accepted...

mkrauser commented 5 years ago

OCR is one of evernotes features, I enjoy the most! So I can simply search scanned PDFs. In my case, handwriting-recognition is not that important for me. I would love to see this in joplin.

theredspoon commented 5 years ago

If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.

Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.

After transitioning from Evernote to Joplin, I'm missing this feature the most. @laurent22 any suggestions on ways to proceed in evaluating whether Tesseract.js can do the work?

lumogas commented 5 years ago

For what it's worth, I would find it very useful. I think it's a very powerful feature of Evernote's search, and it would make the transition to Joplin a lot easier for me.

StoltHD commented 5 years ago

Annotations extractions in PDF can be done by the pdf.js library, the Zotfile plugin for Zotero use this... But please hold it more updated than zotfile...

It would be great to have OCR or at least reading of OCR enabled PDF's in Joplin, som people use Joplin or other notebooks for old historical research and add old Newspapers, Typewritten Patent Documents, Census sheets (Difficult to OCR, but can easily be annotated with comments)..

If you make it as an add-on/Plug-in, people can deside if they want to enable it or not, and you could actually add the plug-ins as seperate download, and make a customable path to a plug-in folder that Joplin reads, and all plug-ins in that folder will be selectable in the plug-in area of joplin...

That way Joplin CORE will not be bloated for those who need a small footprint. (Same could be done with reports, make them add-ons, and make a simple howto so users can make their own reports if they want to...).

stale[bot] commented 5 years ago

Hey there, it looks like there has been no activity on this issue recently. Has the issue been fixed, or does it still require the community's attention? This issue may be closed if no further activity occurs. You may also label this issue as "backlog" and I will leave it open. Thank you for your contributions.

lumogas commented 5 years ago

I still think this feature is worth looking into. Any news?

steve28 commented 4 years ago

Oh please! If Joplin wants to be an Evernote replacement, this needs to be added. This was my main use of evernote - archiving receipts, printed manuals, snail mail that needs to be saved for reference, etc. I rely on this being searchable.

Shamp0o commented 4 years ago

FYI somebody made a script that monitors a folder and uploads new files to Joplin. It also does an tesseract ORC scan of the files and attaches the text as a comment in the markdown code of the note to make it searchable. Works quite well with printed text but is terrible for handwritten text from my experience. If you're willing to put some effort in you have quite a few options to customize tesseract to fit your needs.

philip-peterson commented 4 years ago

Found a promising handwriting recognition engine: https://github.com/githubharald/SimpleHTR

philip-peterson commented 4 years ago

Seems related to #582

dfwdraco76 commented 4 years ago

I'm abandoning OneNote since there is no Linux native client, and OCR of PDF, images, documents, etc. is very high on my list of requirements. I hope this is still under consideration! Thank you.

traffas commented 4 years ago

I just learned of Joplin and was really excited about moving away from Evernote until I learned OCR isn't yet implemented. Without the ability to quickly search through PDFs - mainly 10 years of business receipts for vendor name, account numbers and dollar amounts - Joplin isn't really useful to me. If someone was able to write a script to monitor a folder, could such a script be created to parse Joplin notes to see any that hadn't yet been scanned and run the OCR and just dump the plain text in the note below the picture or PDF? That approach seems like it would handle a large majority of OCR needs, but I don't know how feasible it would be.

llorracc commented 3 years ago

Lack of this feature is the only reason I haven’t converted from Evernote. I’d be happy enough if an import from EN just appended, after the PDF, the already-OCR’d text, or the results of pdf2text.

StoltHD commented 3 years ago

I had to stop using Joplin for my research because of the lack of this...

chongchonghe commented 3 years ago

I had to stop using Joplin for my research because of the lack of this...

My work-around is to take conference notes with Notability, export them to images/PDFs, and insert them to Joplin.

nicolasdarmanthe commented 3 years ago

OCR search is a great feature of Evernote, and a major reason why I am reluctant to go all in on Joplin. I find it super useful in my medical studies, because I can screenshot anatomy diagrams, tables out of lecture slides, etc and paste them directly into my notes. Given that I have 100s of notes (organized in notebooks and with tags), the search function is God to me

Pavan-Bellam commented 3 years ago

I am working on this. Thank you!

StoltHD commented 3 years ago

I am working on this. Thank you!

Hopefully, you work on both full text search in pdf, but also annotation in and extraction of annotations in PDF's (maybe using pdf.js or similar library)?.

Pavan-Bellam commented 3 years ago

I am working on this. Thank you!

Hopefully, you work on both full text search in pdf, but also annotation in and extraction of annotations in PDF's (maybe using pdf.js or similar library)?.

Thank you for suggesting. I will try to implement it as well.

StoltHD commented 3 years ago

I will try to implement it as well.

That would be a great feature :-)

darkcheftar commented 3 years ago

I want to work on this feature.

I will try my best to show some progress on it.

darkcheftar commented 3 years ago

I'm sorry for the overstating and under delivery.

I'm stuck in the process of building a Plugin.
I would appreciate anyone wants to help me.
I will try my best. So sorry for the inconvenience

roman-r-m commented 3 years ago

@darkcheftar you may want to ask on the forum - more people will see it there.

darkcheftar commented 3 years ago

Sure, @roman-r-m Thanks for the suggestion. I will sure try that.

darkcheftar commented 3 years ago

I'm pleased to say this. @ylc395 has developed a Joplin plugin for the OCR feature. He claims that

This plugin is still in development stage. Everything may change, but some features are available now.

Please Check it out.

I just got to know about this from Daeraxa. when I posted the issue if posted in the Joplin's forum.

Thank you.

robe070 commented 1 year ago

I am a long time Evernote user. But current versions of Evernote have lost major functionality and are extremely bug ridden. I'm still using the 'Legacy' version from 3 years ago. So 3 years of development of this new single code platform of Evernote and its still not working well. Its over for me. But its mandatory for me that I have:

Searchable text in images (and imported notes are also searchable).
Searchable text in PDFs (and imported notes are also searchable).

I'm impressed with Joplin. I really like its UI and editor and that the Import from Evernote 'just works' (with a glitch that was fixed very quickly - active development is a BIG tick in and of itself). I'd move to Joplin immediately it has these 2 features added. And I'm multi-platform. So I'd be a paying customer to get the Cloud sync. And OneNote is a competitor, but it does not have these features for imported notes. And of course it does not have proper Evernote-style tag support. For now I'm waiting. I hope its soon. I don't want to pay the astronomical increase in Evernote fees.

robe070 commented 1 year ago

What work has been done on this? What barriers to implementation were discovered?

darkcheftar commented 1 year ago

Hey @robe070, Thanks for your patience.

I tried developing this long ago, I was too immature then could not figure some stuff out like

how to use Tesseract Library inside a plugin,
how to load trained data files into it
where to store the files

If you require it very well. I will be honoured to give it one more try. Thanks for asking!

robe070 commented 1 year ago

That would be great and not just for myself, I think its an almost essential feature for everyone. You don't realize its significance until you've used it (In Evernote) and now its missing.

robe070 commented 1 year ago

@darkcheftar Have you started on this? How is it going?

darkcheftar commented 1 year ago

Hey, I have started on this, trying it in my free time But have you checked joplin-plugin-ocr

timj commented 11 months ago

I am having the same trouble finding an app to replace Evernote. I am yet to find one that will extract text from a JPEG/PNG and I relied on this in Evernote. Furthermore, even though ENEX files have the OCR hints in them (the recoData field) Joplin doesn't have anywhere to put this information so after importing all my Evernote notes I can't find any of the content any more. (EagleFiler does import these hints from the ENEX files).

llorracc commented 11 months ago

I've been waiting so long for PDF for Joplin that I'm tempted to invest some time in figuring out whether Obsidian does what Joplin does but has PDF capabilities. I've got a strong preference for open source, but also a necessity for being able to digest and search PDFs by their textual content.

nicolasdarmanthe commented 11 months ago

Obsidian can definitely do it with the omnisearch plugin. It can OCR into images and pdfs too (I think omnisearch relies on another plugin for this, but either way it's very straight forward to set up).

I finally switched from Evernote to obsidian a few weeks ago and I've been loving every second of it, no regrets.

I actually used Joplin to convert my .enex to markdown (with frontmatter metadata to preserve tags), as I found it did the best job of this out of a few other tools I tried.

Joplin is very neat software but unfortunately this important search capability has been missing for years. Luckily there's obsidian now and the community there are amazing!

On Mon, 27 Nov 2023, 14:35 Christopher Llorracc Carroll, < @.***> wrote:

I've been waiting so long for PDF for Joplin that I'm tempted to invest some time in figuring out whether Obsidian does what Joplin does but has PDF capabilities. I've got a strong preference for open source, but also a necessity for being able to digest and search PDFs by their textual content.

— Reply to this email directly, view it on GitHub https://github.com/laurent22/joplin/issues/807#issuecomment-1827075802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL7JVTJF6RR4MX5KSPT32LYGQDBNAVCNFSM4FV6HC42U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBSG4YDONJYGAZA . You are receiving this because you commented.Message ID: @.***>

wh201906 commented 11 months ago

Is this fixed in #8975?

laurent22 commented 10 months ago

Done in bce94f17753b8b6de117e36d3a8b2022cd28f74c

laurent22 / joplin

Consider integrating OCR support #807

Specification