lucasrla / remarks

Extract annotations (highlights and scribbles) from PDF, EPUB, and notebooks marked with reMarkable tablets. Export to Markdown, PDF, PNG, SVG
GNU General Public License v3.0
356 stars 20 forks source link

Handwritten annotations searchable in output PDF #13

Open akelai opened 3 years ago

akelai commented 3 years ago

(This is far from trivial but proposing it here as an enhancement, as this is the nearest project to it to my knowledge).

It would be very useful to have HWR-texts (obtained in some way* from the texts handwritten on reMarkable on blank pages or annotations on PDF) placed on each page of the output PDF as invisible (searchable) text, so to have the search tool also consider the handwritten notes.

*The HWR-texts to use as input could be extracted by processing an email obtained from my@remarkable.com when using the HWR feature on the reMarkable, or maybe using the myscript.com service.

lucasrla commented 3 years ago

Hey @akelai, thanks for reaching out.

I have never used reMarkable's handwriting conversion, so it's hard for me to evaluate all the details now.

Do you plan to contribute with PRs? I am open to reviewing and merging PRs related to handwriting conversion, but I guess it is best to break what you are proposing down into smaller pieces. And start with the simplest (yet useful) part.

Thanks

akelai commented 3 years ago

I'm willing but not sure if I'll be able to contribute with PRs in a reasonably short amount of time, so feel free to close this issue for now unless you agree this is something nice for someone else to jump in and help with PRs and so worth to keep this open for some more visibility.

Breaking down into smaller pieces, it would be:

  1. get the handwriting conversion
  2. use hocrtransform.py approach to put the text of every page into the PDF, in the right pages, as invisible text
  3. write the output PDF to obtain a somehow searchable PDF

Point 1 could be done using two different approaches:

The second approach could be a check of an IMAP mailbox for a message coming from my@remarkable.com and subject starting with Document from my reMarkable: - probably with something like this - that the tool would take for the text to use in point 2, for adding it to the pages of the input PDF that should be obtained in other ways (e.g. reMarkable companion desktop app, SSH, etc.). A problem with this approach is that the email from reMarkable has HTML body (but that can be parsed to extract the plain text) and most importantly the pages with no text converted (e.g. because they're blank, or contain a drawing) are skipped. That's a problem for the page-page association between the conversion and the input PDF: the "pages" of the conversion are separated by this separator • • •, so they're easily individuated, but a page without converted text is just skipped, not indicated with e.g. two consecutive separators. This can be solved by taking care to write at least a small cross on any page on reMarkable before calling the "Convert to text and send", because that would become a page with just a x, or X, or /, or \ (depending on how the HWR goes on it), so the tool would have all the pages for the association. This would be the price for having a "somehow searchable handwritten PDF".

The first approach wouldn't have this "write x on any page" requirement, but may be less practical [EDIT: on a second thought, speaking of what's "practical", this approach is probably better, also thanks to the high amount of free requests-per-month that are available with the MyScript service] because it requires a developer account on the myScript service. Instead, any reMarkable owner can freely use the handwriting recognition service from the device.

lucasrla commented 3 years ago

Thank you for the update!

Re (1), do you know if reMarkable is doing their conversion on device? Or is it server-side? The support note states that WiFi and login are required, so I guess it happens on servers.

I will have a look at MyScript (they give away 2k requests / month for free) and reMarkable's own handwriting scheme by some time next week.

I would be much happier if we could find a good handwriting recognition (pre-trained on English?!) model that is open source, but I am not sure if that exists. Maybe @meijieru/crnn.pytorch, maybe someone that competed in ICDAR competitions (e.g., 2019, 2021, etc).

Following your suggestion, I will leave this open for now. Hopefully other people will read and eventually chime in.

Thanks

akelai commented 3 years ago

reMarkable uses the MyScript.com service to do the conversion, they have license.

Considering the amount of free requests-per-month on MyScript, maybe from a practical point of view the best approach is actually the first bullet point in my post above.

I don't know if there exist open source alternatives that could provide useful output (and I'm not optimistic on this - handwriting recognition is hard); if not, probably they will come in the near future. So I'd say for this feature, a "plugin" architecture could be better, where one can use the HWR plugin that best fits the aim (e.g. from MyScript, or a trained CRNN model on the local machine, or another service, etc.), and remarks just expects from the plugin the plain texts for each page and puts them invisible in some corner of the pages of the output PDF.

Maybe this one could be a good starting point, because it tries to circumvent the problem of the lack of training data: https://github.com/vinojjayasundara/textcaps

Side note: a nice project could be some AI engine that could be fine-tuned on the personal handwriting of the user, by asking him/her to copy some known texts - a bit like that project that imitated one's voice by asking the effort to read some known sentences. Also, making this task on the reMarkable, it could be possible to take advantage of the actual (vector) information saved by reMarkable, that is probably more informative than the raster image of the writing.

lucasrla commented 3 years ago

Oh, that's very informative. Didn't know reMarkable was relying on MyScript.

For the short term, I am currently leaning towards a simple wrapper (in Python) that calls MyScript's REST API. This no-UI example (in JavaScript) from their iinkJS repository seems like a great starting point.

I wouldn't call such a thing a plugin for now (remarks is way too simple!). But it would be similar in spirit to what we currently do with OCRmyPDF.

On a very high level:

  1. User makes their MyScript developer keys available to remarks (say, via a dot file)
  2. Whenever there is handwriting to be converted, remarks uses those keys to call the API (observing some kind of requests quota) and waits for the conversion to happen
  3. ...

Are you aware of anyone in the reMarkable open source community working on anything similar to this?

This strikes me as something that someone must have worked on already (or at least discussed publicly). It would be great to learn what has already been tried (and eventually build on their efforts).

Thanks

PS: As for the open source handwriting models, despite being the ideal solution, they seem much trickier and I would put them on hold for the short term...

lucasrla commented 3 years ago

Oh, I somehow missed your link to @ddvk/rmapi-hwr! Sorry about that! I am not familiar with Go, but will definitely have a look at their code.

akelai commented 3 years ago

Oh, I somehow missed your link to @ddvk/rmapi-hwr! Sorry about that! I am not familiar with Go, but will definitely have a look at their code.

My bad or Github's CSS bad, the link was actually quite invisible on such a small word.

akelai commented 3 years ago

For the short term, I am currently leaning towards a simple wrapper (in Python) that calls [MyScript's REST API]

That would be great.

Are you aware of anyone in the reMarkable open source community working on anything similar to this?

Besides that one that I already linked, it looks like there aren't other attempts at integrating HWR in one's workflow with reMarkable currently, or they're not easy to find if existing.