marcus-crane / october

A simple GUI for retrieving Kobo highlights and syncing them with Readwise
https://october.utf9k.net
MIT License
171 stars 10 forks source link

Implement support for extracting metadata from OPF manifests #71

Open marcus-crane opened 1 year ago

marcus-crane commented 1 year ago

This will play a solid part in resolving https://github.com/marcus-crane/october/issues/46 but in general, we can get a lot of useful information (including proper cover art) by digging around in the underlying epubs if they still exist on the device by the time highlights are being uploaded.

marcus-crane commented 1 year ago

Not much to look at yet but I've got a small program that parses epubs and tries to find the files that correlate with highlights.

From there, it should be possible to scan the affected chapters to determine how far into each chapter the highlights reside (#46) and come up with a number to represent that.

It probably won't be possible to actually determine a page number (the "page count" is relative based on font size and so on) but we just need some sort of numbering system to consistently order highlights, relative to the amount of content in an epub.

It's sort of like, we can't know ahead of time what will be highlighted ahead of time so we need to figure out an absolute position, use that as an identifier and then any future highlights should arrange themselves so they appear in the same order as they are highlighted in the book.

It would be easier of course if we could just use the database but we have no way to know what each file actually corresponds to when it comes to book ordering.

As a human, I can infer that "part002" is probably chapter 2 but at a programatic level, we have no way to determine where that file falls in the book (which is why we parse the epub as the files have to be arranged in the same order as they would while reading said book)

In that case, the ordering number would represent how many files in (out of the total files) as well as how many paragraphs down and then characters indented the highlighted phrase is. We don't really care about when it ends (I dunno if you can technically have a highlight that is situated within a wider ranging highlight), just when it starts as that is enough to determine our ordering number.

CleanShot 2023-01-04 at 22 57 16

KoboCowboy commented 1 year ago

Hi, my apologies for just opening an issue on highlights and their location as I now see that you have been hard at work trying to find a workaround. It does seem that the Kobo reader itself can order the highlights in terms of their correct order in the book - it will even insert highlights I have tagged with the .h1, .h2 syntax to create headings in Readwise so, not being a programmer in any way, it does suggest that the information is there somewhere.

marcus-crane commented 1 year ago

Hi @KoboCowboy,

Yeah, so the Kobo Reader itself knows where the highlights are because you're selecting a section of a book to highlight. Now, the problem is: As a program that isn't the Kobo, we have these references to parts of a book but we don't actually have any context of what they mean without the book itself.

For example, we might know that a given highlight was in Sentence 52, Paragraph 31 of Chapter 9. That might seem useful like we know that highlight is say; earlier than a highlight in Chapter 10 but we need to reduce that down to something meaningful like "This takes place 58.323% of the way through the book for example.

Anyway, I've started work on a rework of October that parses the underlying books so we can figure out the precise ordering of the highlights. I've got a very small prototype that I can start to build from.

I just got back from a 2 week vacation a couple hours ago so nothing has progressed this month while I've been away but I'll be getting back to looking at this shortly

KoboCowboy commented 1 year ago

Hi Marcus, thanks so much for getting back to me and for your helpful explanation, makes a lot of sense. If you would like me to try anything or provide any other information that can help just let me know.