Identity of publications, replacing a publication

llemeurfr commented 6 years ago

In readium/r2-testapp-swift#90, we try to get a user-friendly message when a document from an OPDS feed is added twice. But we can add the same publication twice if imported from a .lcpl license. And we can do the same, sometime (there's a bug there), when we import a non protected .epub file.

So we need to define in general: 1/ what makes the "identity" of a publication 2/ what shall the app do when we try to import the "same" publication twice.

Note that in the iOS app, the identity of a publication is defined by its file name, which is reaaaally naive. Note also that publications may be corrected, therefore a user should be able to overwrite a previous version.

JayPanoz commented 6 years ago

Note that in the iOS app, the identity of a publication is defined by its file name, which is reaaaally naive.

As far as I can tell, at least one iElephant in the iRoom does the same here. When debugging files, because of the cache, authors will often rename the file so that the iSoftware-for-ebooks will treat it as another publication. Don’t know if this is desirable or not, but it’s at least worth mentioning, given ebook debugging is the hugest pita I know of.

Note that this strategy works with pretty much all other ebook apps you can test so they’re probably naive too.

Note also that publications may be corrected, therefore a user should be able to overwrite a previous version.

Something tells me this is why such iMetadata exists.

HadrienGardeur commented 6 years ago

Well, to identify each publication we do have an identifier in EPUB (but nothing in CBZ).

That said, without a database to handle the bookshelf, it might be difficult to do anything better than filename.

llemeurfr commented 6 years ago

We should be able to a/ import the file with a temporary name (e.g. *-dwnld) b/ parse it, extract the unique identifier and the last modified timestamp c/ rename the file accordingly.

ex. a publication containing in its .opf <dc:identifier id="id">http://www.gutenberg.org/ebooks/25545</dc:identifier> <meta property="dcterms:modified">2010-02-17T04:39:13Z</meta>

... could be renamed in something like http---www.gutenberg.org-ebooks-25545-2010-02-17T04-39-13Z.epub.

This would make file names unique with a very good probability, and different versions would be allowed to coexist (a solution for overwriting an old version would still have to be developed at the app level).

Wouldn't it be a correct solution?

llemeurfr commented 6 years ago

@HadrienGardeur I agree that for file formats which have no unique identification solution, like CBZ, we can't do much.

HadrienGardeur commented 6 years ago

@llemeurfr I would recommend adopting SQLite in the app and using auto-incremented IDs if you need to name assets.

What you're describing is IMO unnecessary complex for handling our assets (covers + EPUB files) and won't scale anyway once a user has plenty of books in the app.

HadrienGardeur commented 6 years ago

Also, this feels like an issue that doesn't belong on this repo. It should probably be moved to the iOS or Android repos (the Kotlin test app already has a SQLite DB).

llemeurfr commented 6 years ago

@HadrienGardeur, this is not the issue which does not belong to the repo, this is the proposed solution. Sure, keeping the catalog in a DB is part of a solution, but it does nothing to avoid filename clashes in the storage folder.

llemeurfr commented 6 years ago

For closing the issue, we just need to check consensus on a/ the identity of an EPUB file is crafted from its unique identifier plus last modified datetime, something we find in the EPUB best practices (maybe even in the spec). b/ a new version of a publication has the same id but an updated last modified datetime, therefore the app is able to ask the user if the new version will replace the older or live side-by-side. c/ if the publication is "the same", the app is able to ask the user if the new file should replace the older (this can solve an issue with a badly formed file that one want to overwrite).

danielweck commented 6 years ago

For what it's worth (just as a point of comparison), the Readium "1" Chrome App UX is: 1) user imports (in the library / bookshelf) a publication with iD="xxx" (given by OPF package dc:identifier metadata) 2) user imports another publication with the same iD => application creates a new entry for the publication (duplicate). This is the default behaviour. 3) advanced user knows how to activate a special "replace by default" mode (no UI, must use web console CLI to perform the switch) => application will overwrite existing entries with the same iD. Once again, no UI user prompt at each import (e.g. okay/cancel dialog), just a persistent choice of default behaviour vs. special opt-in one.

Note that the original publication filename is irrelevant, and not preserved in the internal bookshelf database.

llemeurfr commented 6 years ago

How is the filename generated in R1?

Le 8 mai 2018 à 17:24, Daniel Weck notifications@github.com a écrit :

For what it's worth (just as a point of comparison), the Readium "1" Chrome App UX is:

user imports (in the library / bookshelf) a publication with iD="xxx" (given by OPF package dc:identifier metadata) user imports another publication with the same iD => application creates a new entry for the publication (duplicate). This is the default behaviour. advanced user knows how to activate a special "replace by default" mode (no UI, must use web console CLI to perform the switch) => application will overwrite existing entries with the same iD. Once again, no UI user prompt at each import (e.g. okay/cancel dialog), just a persistent choice of default behaviour vs. special opt-in one. Note that the original publication filename is irrelevant, and not preserved in the internal bookshelf database.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

danielweck commented 6 years ago

Filename? Behind the scenes, there is an auto-generated UUID for each publication imported into the Readium Chrome App's bookshelf, if I remember correctly. That is basically the root of the filesystem entry for an unzipped EPUB (looks like /PATH/TO/PRIVATE/FILESYSTEM/SPACE/{UUID}/META-INF/container.xml (etc.)).

I believe the Readium2 Electron desktop app has a similar design for storing its data on the "private" filesystem (reserved for the application), except EPUB files are not exploded / unzipped (something like /PATH/TO/PRIVATE/FILESYSTEM/SPACE/{UUID}/book.epub).

llemeurfr commented 6 years ago

@dweck But if it is auto generated, this uuid can’t be the same when the “same” publication is re-imported isn’t it ?

danielweck commented 6 years ago

In the current implementation, the filename is totally ignored / insignificant. The dc:identifier OPF metadata is used to detect publications with the same "identity", and by default "identical" publications are duplicated in the bookshelf, each with their own UUID under the hood (consequently: different persistent / saved reading location, bookmarks, etc.).

llemeurfr commented 5 years ago

It appears that this issue is still open, we didn't conclude and therefore didn't spec a solution.

Proposals:

at the level of the RWPM, the identifier becomes a MUST (it is a SHOULD currently).
minting a unique identifier is the work of a parser. For EPUB, this is easy, as the dc:identifier is a MUST. For packaged audiobooks and divina, as they contain a RWPM, the identifier becomes a MUST so no problem either. For CBZ (and maybe other formats later), there is no unique id so we must find a heuristic.

aferditamuriqi commented 5 years ago

Kotlin:

we use the publication.metadata.identifier to determine if a publication is unique
when importing a publication that already is in the library that is what we use to distinguish if it could be a duplicate
for CBZ, since we have no identifier, we use a generated random UUID (therefore a CBZ can actually never be considered a duplicate since the UUID is generated)
audiobooks and divina no issue, we have the identifier

Swift:

same as Kotlin, with the exception of CBZ which currently is still using the file name, but will be changed to match kotlin. using the file name could cause issues when a completely different cbz with an existing filename is imported or deleted
not to forget PDF: currently, if there is no identifier, then the rootPath to the file is used.

llemeurfr commented 5 years ago

Note: for formats with no defined identifier (like CBZ) we may defined a publication identifier via a UUID at the parser level, as done in Kotlin today, we may also try to find more subtile heuristics like a hash on the concatenation of all file names found in the archive.

mickael-menu-mantano commented 5 years ago

We could use a MD5 checksum of the file for CBZ. This is pretty safe, handle renaming and filename conflicts and doesn't need to store anything in the database.

HadrienGardeur commented 5 years ago

A checksum is a much better approach than a random UUID (especially one that is not consistent between sessions and devices) but we must be careful about the cost. For a single publlication it's fine, but when you need to calculate this across a large number of publications, this can be significant on lower-end devices.

aferditamuriqi commented 5 years ago

For CBZ, I did a quick test if I can use a checksum, and it actually works well. what I tested is, creating a checksum on the file itself, not necessary on all the files in the archive. If the same archive is created again though, it won't work. I think this is though good enough and better for sure then a uuid, since this way if the file is the same it can be recognized as already existing.

llemeurfr commented 5 years ago

Therefore, the proposal becomes:

the RWPM identifier becomes a MUST (it is a SHOULD currently).
minting a RWPM identifier is the work of a parser. a/ EPUB: the RWPM identifier. is the dc:identifier (required) b/ Audiobooks and Divina: the RWPM identifier (now required) is already there. c/ CBZ and any other format deprived from an identifier: the RWPM identifier is a MD5 checksum calculated on the container file at the time of import.

Note: the date of last modification of the publication is sometime used as a versioning mechanism by the publisher. It can also be used by the UA to show a more detailed message when a publication is imported with the same id as an already stored publication (eg. "An older publication is already present on the bookshelf. Do you want to replace it?")

llemeurfr commented 4 years ago

Status report 13/11; implementations are not what we talked about before: On Android, at import time, for CBZ, a MD5 is computed on the overall package file.

On Desktop, we scan the zip directory, quickly aggregate the CRC32 of each entry (dir item), create a SHA1 digest, store this in the DB as a text field. Same goes with LCP from the publication content after the license is injected inside the zip.
Cause: many people use a bunch of test EPUB files with the same id.

On iOS, sometime we don't get the primary id but another id, we'll have to review the parsing.

It's an "identity" more than an "identifier" (as this term is overloaded). The primary id is a URI, but his "identity" isn't and is complementary to the id (if it exists) and only used for detecting duplicates. It is an app level info, therefore not represented in the RWPM model.

readium / architecture

Identity of publications, replacing a publication #68