Closed laurent22 closed 10 months ago
I would recommend against it, because adding major features like these require a lot of development time, that may be better spent on the core features of the app.
There are already capable applications, implementing Tesseract on all major platforms:
gImageReader - can export to hOCR, which is a micro format for OCR export via HTML, see also git repository and obligatory Wikipedia entry. An importer would need to be written, since this HTML is very structured and contains data, that is redundant for Joplin.
then, while not OCR, but PDF extraction, there is also
(I use all three programs on Windows and they are very mature and powerful)
EDIT: Isn't the main feature of Evernote's OCR the recognition of handwriting? AFAIK Tesseract doesn't do that. It's still considered some magic wizardry, which may need performant systems and (patented?) algorithms. I wouldn't be surprised, if Evernote did such OCR on their servers in the cloud.
I concur with bmix. Not that OCR is a BAD idea, only that it seems a distraction when so many other things are calling and not a priority in the least given how many OCR options I have outside of Joplin. There are many features I'd prefer to see before this one!
If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.
Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.
I would pay a lot to get handwrite-recognition and OCR. For me, that is the most critical feature to make Joplin a real replacement for Evernote, Onenote, etc. I am sadly not able to add this...
I do not want to use a keyboard in meetings. Using a pencil better accepted...
OCR is one of evernotes features, I enjoy the most! So I can simply search scanned PDFs. In my case, handwriting-recognition is not that important for me. I would love to see this in joplin.
If it can be easily added using the Tesseract.js library it would be a good addition. The point is to be able to search attached documents, PDFs in particular, in the app itself. So for this external tools can't really help.
Most likely Evernote do this on their server, yes, but maybe it's possible to get the desktop app to do the job in the background. It's only under consideration at this point and if it cannot be done in a reliable way it won't be added.
After transitioning from Evernote to Joplin, I'm missing this feature the most. @laurent22 any suggestions on ways to proceed in evaluating whether Tesseract.js can do the work?
For what it's worth, I would find it very useful. I think it's a very powerful feature of Evernote's search, and it would make the transition to Joplin a lot easier for me.
Annotations extractions in PDF can be done by the pdf.js library, the Zotfile plugin for Zotero use this... But please hold it more updated than zotfile...
It would be great to have OCR or at least reading of OCR enabled PDF's in Joplin, som people use Joplin or other notebooks for old historical research and add old Newspapers, Typewritten Patent Documents, Census sheets (Difficult to OCR, but can easily be annotated with comments)..
If you make it as an add-on/Plug-in, people can deside if they want to enable it or not, and you could actually add the plug-ins as seperate download, and make a customable path to a plug-in folder that Joplin reads, and all plug-ins in that folder will be selectable in the plug-in area of joplin...
That way Joplin CORE will not be bloated for those who need a small footprint. (Same could be done with reports, make them add-ons, and make a simple howto so users can make their own reports if they want to...).
Hey there, it looks like there has been no activity on this issue recently. Has the issue been fixed, or does it still require the community's attention? This issue may be closed if no further activity occurs. You may also label this issue as "backlog" and I will leave it open. Thank you for your contributions.
I still think this feature is worth looking into. Any news?
Oh please! If Joplin wants to be an Evernote replacement, this needs to be added. This was my main use of evernote - archiving receipts, printed manuals, snail mail that needs to be saved for reference, etc. I rely on this being searchable.
FYI somebody made a script that monitors a folder and uploads new files to Joplin. It also does an tesseract ORC scan of the files and attaches the text as a comment in the markdown code of the note to make it searchable. Works quite well with printed text but is terrible for handwritten text from my experience. If you're willing to put some effort in you have quite a few options to customize tesseract to fit your needs.
Found a promising handwriting recognition engine: https://github.com/githubharald/SimpleHTR
Seems related to #582
I'm abandoning OneNote since there is no Linux native client, and OCR of PDF, images, documents, etc. is very high on my list of requirements. I hope this is still under consideration! Thank you.
I just learned of Joplin and was really excited about moving away from Evernote until I learned OCR isn't yet implemented. Without the ability to quickly search through PDFs - mainly 10 years of business receipts for vendor name, account numbers and dollar amounts - Joplin isn't really useful to me. If someone was able to write a script to monitor a folder, could such a script be created to parse Joplin notes to see any that hadn't yet been scanned and run the OCR and just dump the plain text in the note below the picture or PDF? That approach seems like it would handle a large majority of OCR needs, but I don't know how feasible it would be.
Lack of this feature is the only reason I haven’t converted from Evernote. I’d be happy enough if an import from EN just appended, after the PDF, the already-OCR’d text, or the results of pdf2text.
I had to stop using Joplin for my research because of the lack of this...
I had to stop using Joplin for my research because of the lack of this...
My work-around is to take conference notes with Notability, export them to images/PDFs, and insert them to Joplin.
OCR search is a great feature of Evernote, and a major reason why I am reluctant to go all in on Joplin. I find it super useful in my medical studies, because I can screenshot anatomy diagrams, tables out of lecture slides, etc and paste them directly into my notes. Given that I have 100s of notes (organized in notebooks and with tags), the search function is God to me
I am working on this. Thank you!
I am working on this. Thank you!
Hopefully, you work on both full text search in pdf, but also annotation in and extraction of annotations in PDF's (maybe using pdf.js or similar library)?.
I am working on this. Thank you!
Hopefully, you work on both full text search in pdf, but also annotation in and extraction of annotations in PDF's (maybe using pdf.js or similar library)?.
Thank you for suggesting. I will try to implement it as well.
I will try to implement it as well.
That would be a great feature :-)
I want to work on this feature.
I'm sorry for the overstating and under delivery.
I'm stuck in the process of building a Plugin.
I would appreciate anyone wants to help me.
I will try my best. So sorry for the inconvenience
@darkcheftar you may want to ask on the forum - more people will see it there.
Sure, @roman-r-m Thanks for the suggestion. I will sure try that.
I'm pleased to say this. @ylc395 has developed a Joplin plugin for the OCR feature. He claims that
This plugin is still in development stage. Everything may change, but some features are available now.
Please Check it out.
I just got to know about this from Daeraxa. when I posted the issue if posted in the Joplin's forum.
Thank you.
I am a long time Evernote user. But current versions of Evernote have lost major functionality and are extremely bug ridden. I'm still using the 'Legacy' version from 3 years ago. So 3 years of development of this new single code platform of Evernote and its still not working well. Its over for me. But its mandatory for me that I have:
I'm impressed with Joplin. I really like its UI and editor and that the Import from Evernote 'just works' (with a glitch that was fixed very quickly - active development is a BIG tick in and of itself). I'd move to Joplin immediately it has these 2 features added. And I'm multi-platform. So I'd be a paying customer to get the Cloud sync. And OneNote is a competitor, but it does not have these features for imported notes. And of course it does not have proper Evernote-style tag support. For now I'm waiting. I hope its soon. I don't want to pay the astronomical increase in Evernote fees.
What work has been done on this? What barriers to implementation were discovered?
Hey @robe070, Thanks for your patience.
I tried developing this long ago, I was too immature then could not figure some stuff out like
If you require it very well. I will be honoured to give it one more try. Thanks for asking!
That would be great and not just for myself, I think its an almost essential feature for everyone. You don't realize its significance until you've used it (In Evernote) and now its missing.
@darkcheftar Have you started on this? How is it going?
Hey, I have started on this, trying it in my free time But have you checked joplin-plugin-ocr
I am having the same trouble finding an app to replace Evernote. I am yet to find one that will extract text from a JPEG/PNG and I relied on this in Evernote. Furthermore, even though ENEX files have the OCR hints in them (the recoData
field) Joplin doesn't have anywhere to put this information so after importing all my Evernote notes I can't find any of the content any more. (EagleFiler does import these hints from the ENEX files).
I've been waiting so long for PDF for Joplin that I'm tempted to invest some time in figuring out whether Obsidian does what Joplin does but has PDF capabilities. I've got a strong preference for open source, but also a necessity for being able to digest and search PDFs by their textual content.
Obsidian can definitely do it with the omnisearch plugin. It can OCR into images and pdfs too (I think omnisearch relies on another plugin for this, but either way it's very straight forward to set up).
I finally switched from Evernote to obsidian a few weeks ago and I've been loving every second of it, no regrets.
I actually used Joplin to convert my .enex to markdown (with frontmatter metadata to preserve tags), as I found it did the best job of this out of a few other tools I tried.
Joplin is very neat software but unfortunately this important search capability has been missing for years. Luckily there's obsidian now and the community there are amazing!
On Mon, 27 Nov 2023, 14:35 Christopher Llorracc Carroll, < @.***> wrote:
I've been waiting so long for PDF for Joplin that I'm tempted to invest some time in figuring out whether Obsidian does what Joplin does but has PDF capabilities. I've got a strong preference for open source, but also a necessity for being able to digest and search PDFs by their textual content.
— Reply to this email directly, view it on GitHub https://github.com/laurent22/joplin/issues/807#issuecomment-1827075802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL7JVTJF6RR4MX5KSPT32LYGQDBNAVCNFSM4FV6HC42U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBSG4YDONJYGAZA . You are receiving this because you commented.Message ID: @.***>
Is this fixed in #8975?
Done in bce94f17753b8b6de117e36d3a8b2022cd28f74c
It seems possible to add support for OCR content in Joplin via the Tesseract library: http://tesseract.projectnaptha.com
A first step would be to assess the feasibility of this project by integrating the lib in the desktop app and trying to OCR an image.
If everything works well, we can add the feature to the app.
Specification
Advantage of it doing that way:
Format of OCR text block
For example, for a resource called "TrainTicket.png":
The advantage of this format is that it will render nicely in the viewer, and it will still be clearly identified as OCR content, which means later we can identify these blocks, update them, remove them, etc.
Later
More considerations
notes_normalized
table, append the content of any attached resource text, so that it's indexed by the search engine