shermp / Kobo-UNCaGED

UNCaGED, for Kobo devices
GNU Affero General Public License v3.0
101 stars 7 forks source link

Musings around book addition #22

Open shermp opened 5 years ago

shermp commented 5 years ago

KU currently follows the conservative Calibre approach of letting Nickel import books into the DB, and then modifying metadata afterwards. This approach, while it works, is time consuming, and a battery hog.

I'm currently mulling over the idea of creating the book record(s) myself, bypassing Nickel's "Importing Content" stage. This would have the big advantage of doing everything in one step, removing a partition mount/dismount cycle, removing the need to use FBInk to attempt to detect the end of the import process etc.

The main downside would be the potential to "stuff it up", and would probably introduce more firmware compatibility constraints.

@NiLuJe and @geek1011, do you have any comments and/or potential concerns over taking this approach? I would probably add it as an alternative path, rather than a replacement.

NiLuJe commented 5 years ago

There's rather a lot of stuff happening behind the scenes during the import process, AFAICT (you can peek at it via the sql debug logs), so, why not, but that'd be a rather imposing task, with the risk of severely screwing things up if it goes wrong ;).

NiLuJe commented 5 years ago

I also have no idea how to get Nickel to actually show the new content without an USBMS session (or an sdcard scan)...

shermp commented 5 years ago

I'd still do stuff in USBMS mode, such that hopefully Nickel will sync with the DB when it has left USBMS mode.

I did a quick sqldiff on a DB pre sideload and post sideload, to see what it was potentially creating. Seemed to add two records to the content table, one to the AnalyticsEvents table, and one to the volume_shortcovers table.

Would definitely have to do some more investigation what Nickel does SQL wise, to see how difficult it is.

Thankfully, I now have a spare H2O to test with, without messing up my main reading device :)

EDIT: And the AnalyticsEvents appears to be something to do with the USBMS itself, so may not need to worry about that

NiLuJe commented 5 years ago

Oh, right ;).

You should be able to forget the Analytics stuff, obviously ;).

But what's mainly worrying me is the crapload of stuff done to content to handle chapters or whatever ;).

(Which I thought might have been KePub only, but recent Drobox experiments showed that it happens with ePubs, too).

shermp commented 5 years ago

When you first sideload a book, it only seems to add one "chapter" record, at least for the test book. I suspect the rest might be added upon first opening, although I would have to test more books to be sure.

EDIT: My test book added a ....#(0) record.

shermp commented 5 years ago

Aww crap. Tested with a different book, and yeah, Nickel does add the chapters to the content table :(

EDIT: And it's not as simple as Nickel adding all the spine elements from the opf as-is either :(

pgaskin commented 5 years ago

Yes, I've been looking at this every now and then, and have successfully imported a few kepubs manually (I haven't tried with plain epubs or any other format, and my kepubs are all manually fixed to be well-formed). I haven't written an automated tool for this, and if I do, I'll probably look into how libnickel does it. Firmware compatibility isn't too bad, though.

@davidfor might have some comments about this.

shermp commented 5 years ago

Hmm... Looks like it might be using the ncx TOC file to generate the records. With a (well formed) test ebook, that's what corresponds with what Nickel added into the DB.

davidfor commented 5 years ago

For a each book, there is a row for the book added to the content table. Then there are rows for the ToC entries or internal files depending on the format.

For epubs and PDF, for each entry in the ToC, there is a row in the content and content_shortcover tables.

For kepubs, there is a row for each spine entry in both content and content_shortcover. Then there is another row in content for each ToC entry.

I've looked at this a few times, and just don't think it is worth the hassle. And even if you do it, it is going to take time and use battery. To do this, you need to extract the appropriate file from the book, process it an insert rows into the database. It could probably be more efficient as Kobo uses the Adobe RMSDK to handle epubs and PDFs. Kepubs must be their own code.

I've looked at doing this from calibre when sending a book, and I just don't think the advantage is worth the effort or risk. I can only see to advantages. One is avoiding the import, but, that just changes when it is done. The other is being able to set the series info during the send. I don't see that either are worth it.

Having said that, I do have code for updating the epub ToC when replacing a book. When I have time, I'll be cleaning that up and adding kepub support. When I do that, it will be added to my Kobo Utilities plugin, not the driver.

shermp commented 5 years ago

Thanks for the input @davidfor

Regarding the battery, it's the whole "poll the screen over and over to try and detect the completion of importing", then re-entering USBMS a second time, remounting partitions, updating DB, unmounting partition etc. that would be great to avoid.

Also, time. The above rigmarole takes quite a while to complete. I would love it if I could decrease that time.

To be honest though, it's probably the WiFi that's the biggest battery hog. Not a lot that can be done about that.

pgaskin commented 5 years ago

Oh, and just so you know, the fastest time I've been able to scan an EPUB file in Go on the Kobo has been around 1.3s each (using goroutines for concurrency, not validating paths, and reading the zip in memory). You might be able to get it down to 1s with a faster XML parser, but not much faster than that. This includes the time it takes to write a batched insert to the db.

shermp commented 5 years ago

Oh, and just so you know, the fastest time I've been able to scan an EPUB file in Go on the Kobo has been around 1.3s each (using goroutines for concurrency, not validating paths, and reading the zip in memory). You might be able to get it down to 1s with a faster XML parser, but not much faster than that. This includes the time it takes to write a batched insert to the db.

To be fair, Nickel's import process doesn't appear to be much quicker, if at all.

shermp commented 5 years ago

Just had another, different idea. It's a variation on what what I've thought about previously involving SQL triggers, @davidfor may wish to look away now, although the new idea is a bit more ... refined. Still a dirty hack though,

How about extending the DB schema to add a new table like ku_metadata_update. The table could have a contentid column, and a set of metadata columns. Adding a book adds a record to this table. Then, define an AFTER INSERT trigger that updates the book record after Nickel adds it to the DB, and deletes the corresponding row in the ku_metadata_update table.

Assuming Nickel isn't bothered by an extra unknown table of course...

EDIT: This assumes sqlite is flexible enough to do this or course...

pgaskin commented 5 years ago

@shermp, that's actually a great idea! I'll play around with it sometime this week. I'll probably make an experimental version of seriesmeta which uses this trick.

pgaskin commented 5 years ago

Here's a quick SQL thing I put together just now. I've done some manual testing with it, but I haven't tried with an actual import yet:

CREATE TABLE IF NOT EXISTS _seriesmeta (
  ImageId      TEXT NOT NULL UNIQUE,
  Series       TEXT,
  SeriesNumber INTEGER,
  PRIMARY KEY(ImageId)
);

DROP TRIGGER IF EXISTS _seriesmeta_insert;
DROP TRIGGER IF EXISTS _seriesmeta_update;
DROP TRIGGER IF EXISTS _seriesmeta_delete;

CREATE TRIGGER _seriesmeta_insert
  AFTER INSERT ON content WHEN
    /*(new.Series IS NULL) AND*/
    (new.ImageId LIKE "file____mnt_onboard_%") AND
    (SELECT count() FROM _seriesmeta WHERE ImageId = new.ImageId)
  BEGIN
    UPDATE content
    SET
      Series       = (SELECT Series       FROM _seriesmeta WHERE ImageId = new.ImageId),
      SeriesNumber = (SELECT SeriesNumber FROM _seriesmeta WHERE ImageId = new.ImageId)
    WHERE ImageId = new.ImageId;
    /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
  END;

CREATE TRIGGER _seriesmeta_update
  AFTER UPDATE ON content WHEN
    /*(new.Series IS NULL) AND*/
    (new.ImageId LIKE "file____mnt_onboard_%") AND
    (SELECT count() FROM _seriesmeta WHERE ImageId = new.ImageId)
  BEGIN
    UPDATE content
    SET
      Series       = (SELECT Series       FROM _seriesmeta WHERE ImageId = new.ImageId),
      SeriesNumber = (SELECT SeriesNumber FROM _seriesmeta WHERE ImageId = new.ImageId)
    WHERE ImageId = new.ImageId;
    /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
  END;

CREATE TRIGGER _seriesmeta_delete
  AFTER DELETE ON content
  BEGIN
    DELETE FROM _seriesmeta WHERE ImageId = old.ImageId;
  END;
pgaskin commented 5 years ago

Here's a better version which puts the metadata directly into the content table if already imported:

CREATE TABLE IF NOT EXISTS _seriesmeta (
    ImageId      TEXT NOT NULL UNIQUE,
    Series       TEXT,
    SeriesNumber TEXT,
    PRIMARY KEY(ImageId)
);

/* Adding series metadata on import */

DROP TRIGGER IF EXISTS _seriesmeta_content_insert;
CREATE TRIGGER _seriesmeta_content_insert
    AFTER INSERT ON content WHEN
        /*(new.Series IS NULL) AND*/
        (new.ImageId LIKE "file____mnt_onboard_%") AND
        (SELECT count() FROM _seriesmeta WHERE ImageId = new.ImageId)
    BEGIN
        UPDATE content
        SET
            Series       = (SELECT Series       FROM _seriesmeta WHERE ImageId = new.ImageId),
            SeriesNumber = (SELECT SeriesNumber FROM _seriesmeta WHERE ImageId = new.ImageId)
        WHERE ImageId = new.ImageId;
        /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
    END;

DROP TRIGGER IF EXISTS _seriesmeta_content_update;
CREATE TRIGGER _seriesmeta_content_update
    AFTER UPDATE ON content WHEN
        /*(new.Series IS NULL) AND*/
        (new.ImageId LIKE "file____mnt_onboard_%") AND
        (SELECT count() FROM _seriesmeta WHERE ImageId = new.ImageId)
    BEGIN
        UPDATE content
        SET
            Series       = (SELECT Series       FROM _seriesmeta WHERE ImageId = new.ImageId),
            SeriesNumber = (SELECT SeriesNumber FROM _seriesmeta WHERE ImageId = new.ImageId)
        WHERE ImageId = new.ImageId;
        /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
    END;

DROP TRIGGER IF EXISTS _seriesmeta_content_delete;
CREATE TRIGGER _seriesmeta_content_delete
    AFTER DELETE ON content
    BEGIN
        DELETE FROM _seriesmeta WHERE ImageId = old.ImageId;
    END;

/* Adding series metadata directly when already imported */

DROP TRIGGER IF EXISTS _seriesmeta_seriesmeta_insert;
CREATE TRIGGER _seriesmeta_seriesmeta_insert
    AFTER INSERT ON _seriesmeta WHEN
        (SELECT count() FROM content WHERE ImageId = new.ImageId)
        /*AND ((SELECT Series FROM content WHERE ImageId = new.ImageId) IS NULL)*/
    BEGIN
        UPDATE content
        SET
            Series       = new.Series,
            SeriesNumber = new.SeriesNumber
        WHERE ImageId = new.ImageId;
        /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
    END;

DROP TRIGGER IF EXISTS _seriesmeta_seriesmeta_update;
CREATE TRIGGER _seriesmeta_seriesmeta_update
    AFTER UPDATE ON _seriesmeta WHEN
        (SELECT count() FROM content WHERE ImageId = new.ImageId)
        /*AND ((SELECT Series FROM content WHERE ImageId = new.ImageId) IS NULL)*/
    BEGIN
        UPDATE content
        SET
            Series       = new.Series,
            SeriesNumber = new.SeriesNumber
        WHERE ImageId = new.ImageId;
        /*DELETE FROM _seriesmeta WHERE ImageId = new.ImageId;*/
    END;

You can uncomment the first comment in each trigger to not replace existing metadata, and uncomment the last one to only update the metadata once.

shermp commented 5 years ago

I was going to look into this myself, but if someone else is willing to do all the hard (cough SQL cough) stuff...

If it's an idea you think we should pursue further, perhaps we could spin it off as a standalone specification that could be used by other applications.

pgaskin commented 5 years ago

If it's an idea you think we should pursue further, perhaps we could spin it off as a standalone specification that could be used by other applications.

I do think it is. I'm already working on adding this to seriesmeta, and I'll test it on my Kobo itself tomorrow. The biggest thing is there will need to be a way to prevent conflicts between multiple applications using this (which can be partly solved by uncommenting the lines I commented).

shermp commented 5 years ago

It might be enough to have a well defined schema that should be adhered to. And yeah, I think deleting the row once the update has been made might be a good idea.

pgaskin commented 5 years ago

Oops, wrong issue, see https://github.com/geek1011/kepubify/pull/43#issuecomment-538651599.