Some notes about thornyreader crengine fork

poire-z commented 5 years ago

New active fork of crengine found by @pazos at https://github.com/koreader/koreader/issues/4745#issuecomment-470086270:

https://github.com/irrationalunzen/thornyreader thornyreader

Spent the evening reading git log -p... so sharing some observations:

There had been some huge reformatting (indent, brackets...), renames (like tinyNodeCollection renamed to CrDomBase), reordering of sections of code, removal (many source files or code they didn't need was removed, the most surprising is all the crengine caching that I find is essential). All this makes the possibly interesting patches unappliable without much work. I've often been tempted to do that, for easier walking into the code - but never did, and I'm glad I didn't, now :)
They have a CreBrigde.cpp that acts a bit like our cre.cpp.
They don't seem to use much of crengine like we do - and it feels like it's used mosly as a parser and renderer to lay out text, and they pick up "boxes" (hitbox) of text like may be they do for the other engines (mupdf, djvu), and have some common code do more stuff with that - unlike us that delegate more stuff to crengine (like finding the nodes that is rendered at x/y full coordinantes). It seems they work page per page: render 1 or 2 pages, get the bitmap and some boxes, and that's nearly all.
They do a lot more dedicated XML re-parsing - and many checks for specific tags in the XML parsing code (like, if it's a "span", ignore it, if it's a "p", remove some last space) - which seems very ugly. It really feels like they didn't get the full picture of crengine, and did some small hacks here and there to quickly solve some localized problems. Or may be it's all their frontend/common code (that I didn't look at) that required that.
I didn't seen any real fix to the initial codebase (that was the main motivation for going thru all their commits - but it was huge, so I may have missed them). Their upstream starting commit (and possibly last sync) was https://github.com/buggins/coolreader/commit/1e07d15.

Some noticable stuff they added/changed (in bold, the things that could be of interest to us): (the commit links I linked to is usually the most interesting one in the set - but the features come each with many many commits and later fixes).

book loading with explicite doc format parsing request (not really needed for us)
CSS specificity base on declaration order (wrong, we implemented that the right way I think)
smart txt detection (?)
tag filtering (style, tables) - which seems like it limits what crengine can do
https://github.com/irrationalunzen/thornyreader/commit/d1427538b2d5486777a8935dd9465df4bf91c438 CSS selector parent class support (we didn't need that, it seems it works for us?)
https://github.com/irrationalunzen/thornyreader/commit/a3c43ea0c62e80d4682410eca77e9fa702ead9ac (and many more commits) multiple fallbacks font support (only for inspiration, we could probably do that simpler if needed)
https://github.com/irrationalunzen/thornyreader/commit/22ccd662596923985e479bc76be08728bc5d6eac getRect() refactored (which looks cleaner - but without all our fixes, extended, support for padding and borders)
https://github.com/irrationalunzen/thornyreader/commit/0dd89110d4f5e35ba7fee67aac4fd68c414eeae5 home made support for DOCX
https://github.com/irrationalunzen/thornyreader/commit/0127faa12526546b799df5b7612602047829ccb2 DOCx footnotes (all the commits mentionning footnotes are about docx, or their need to extract footnotes content (by still reparsing the EPUB xml...) to draw them themselves on the final page - so no interest for us
https://github.com/irrationalunzen/thornyreader/commit/debf43b8f3bf2ba9f0dd1b2c09db411108e311b4 https://github.com/irrationalunzen/thornyreader/commit/469b3590a399a5a5246b924cb2aa31fecdf2e870 in-page footnotes for EPUB (that we have) but done at EPUB parsing time (!)
https://github.com/irrationalunzen/thornyreader/commit/88691b80c6faaa7f122752d41ca6205132fd8ae0 alternative TOC (like we have)
https://github.com/irrationalunzen/thornyreader/commit/269d077834abd7876ebc5bff3296c18b43b27557 multiple authors in meta (like we have)
https://github.com/irrationalunzen/thornyreader/commit/6fb399702519a4e79578d25eaf726e98167817fb https://github.com/irrationalunzen/thornyreader/commit/cdad19d021fe0f26d2de7567da751e67e035ad43 Epub3 series metadata support (whatever that is)
https://github.com/irrationalunzen/thornyreader/pull/50 https://github.com/irrationalunzen/thornyreader/pull/51 Embedded styles support (?!) with a new parser/object , "class" oriented (@frankyifei 's multi classnames fixes might have avoided us the need for that kind of stuff)
https://github.com/irrationalunzen/thornyreader/pull/49 RTL right-to-left / BIDI support possibly the most interesting - it started with DrawTextString(... bool reverse) but things ended up with https://github.com/irrationalunzen/thornyreader/commit/59e8058954fccf560f1e42b2ed501282c98d3601 "Moved most RTL-related features to a separate file", so it's in include/rtlhandler.h and src/rtlhandler.cpp - and some use of that at parsing time and in lvtextfm.cpp. Probably not usable as is - and there are lots of arabic char codes, so they do the positionning/sizing/rendering manually - while we could probably have harfbuzz do some of that, as it already renders/draw individual words in rtl. But it's definitely something to read to see how we could start to go at that.

(so, in bold, the things that could be of interest to us - even if I'm not particularly interested by any of them :) but I'll surely be reading that RTL stuff - just to see how easy or risky it would be in regards to all the LTR text fixes we added).

Frenzie commented 5 years ago

That was fast. :-)

smart txt detection (?)

Not sure what you mean by that but also see our #39.

Probably not usable as is - and there are lots of arabic char codes, so they do the positionning/sizing/rendering manually - while we could probably have harfbuzz do some of that, as it already renders/draw individual words in rtl. But it's definitely something to read to see how we could start to go at that.

Yeah, plus HarfBuzz does it for Persian, etc., etc.

poire-z commented 5 years ago

smart txt detection (?)

Not sure, not my words :) CONFIG_CRENGINE_TXT_SMART_FORMAT and DOC_FLAG_TXT_NO_SMART_FORMAT were added in their first commit, and possibly disappeared later. Stuff about txt content parsing, checking more stuff like nb chars per line, some kind of reflowing, wrapping to 80 columns...

Didn't mention it above, but they did indeed many updates/fixes to the little document formats, like mobi (they included another library to do that), txt, rtf, docx... Which might explain why they didn't care much about the more complex CSS/HTML renderings like tables, etc...

Frenzie commented 5 years ago

@poire-z The repo seems to have vanished. O_o Do you still have a local copy?

xelxebar commented 5 years ago

FWIW, it looks like the author's GitHub account is still active. I just tried sending an email, requesting temporary access to the repo.

Frenzie commented 5 years ago

@xelxebar Besides giving some indications as to which part of the code to look at it may not have been super helpful for complex rendering changes like bidi or vertical judging by @poire-z's analysis above, but it seemed nice to have a relevant example in any case.

poire-z commented 5 years ago

Yes, I do have a local copy. I dunno what I'm allowed to do with it... If the original author has removed it, it would feel awkward to make it available again. I could share here as attachment the 2 or 3 files related to RTL (they would probably miss some stuff in other files - but that code should really not be used as is anyway).

link2xt commented 5 years ago

Here is another copy of thornyreader: https://github.com/readera/thornyreader It is used in the ReadEra EReader, see https://readera.ru/en/reader-book or https://readera.ru/reader-book for direct APK links.

@poire-z You are allowed to make it available again, since Cool Reader is released under the GPL license. By the way, thornyreader has license problems, LICENSE files contains GNU AGPL which is not compatible with the GPL license of Cool Reader. And the source code of ReadEra is not published.

link2xt commented 5 years ago

@poire-z You may want to write the author first anyway, of course, maybe he has reasons to try to remove this repository. I guess it was just moved to readera organization without announcement.

poire-z commented 5 years ago

Here is another copy of thornyreader: https://github.com/readera/thornyreader

Thanks for finding that one. But it's actually lagging by 669 commits from the repo I cloned some months ago. But that's good, as I guess I can share a patch file with the 516 commits involving the crengine/ subdirectory, that the really interested person can apply on their clone of that readera/thornyreader repo: thornyreader_crengine_after_20180207_af19707.patch.zip (Made with git format-patch --stdout af19707 -- crengine if that matters)

xelxebar commented 5 years ago

FWIW, I did contact the original author and got a reply. Apparently, they were developing thornyreader for their employer and got strong-armed into taking the repo down. It seems like they personally would like the repo to be available, but it's unclear to me whether the patchset is legally usable. The GPL only guarantees source availability once the binaries have been distributed, no?

link2xt commented 5 years ago

@xelxebar

The GPL only guarantees source availability once the binaries have been distributed, no?

Yes. But the binaries are distributed both on https://readera.ru/reader-book and https://play.google.com/store/apps/details?id=org.readera

ReadEra uses Cool Reader based engine, which is licensed under GPL. GPL does not require authors to publish the source, but any user can demand the source code from them. And since binaries are freely distributed, you can easily become the user.

EDIT: I would not contact the ReadEra with these demands, though. Better contact the authors directly to avoid any consequences for them (though not following the GPL is obviously their employer's fault).

link2xt commented 5 years ago

It seems like they personally would like the repo to be available, but it's unclear to me whether the patchset is legally usable.

It may be not legally usable in one corner case: if the authors took some code that was licensed under the GPL-incompatible license and used it in their GPL software fork. That would be illegal in the first place so the GPL cannot make this combined codebase legal even if the users were not aware of the origin of the code.

poire-z commented 5 years ago

As I mentionned in the first post, there is really not much we'd want to borrow, or even could borrow (because their use of crengine looked so different than ours). So, no real need to get answers to these legal questions. The patches are just something one can have a look at to see how these features have been done on this other planet, like we would probably look at some other software written in a different language - some kind of tourism. (The only thing that felt like something we could possibly pick is the DOCX support - but I don't think there's that much interest in having it.)

poire-z commented 4 years ago

Now that we got DocX support from upstream, and our own RTL/Bidi support, nothing much to pick from that fork. Just cut and pasting the "Epub3 series metadata support" logic - in case we some day need to look at these other fields and properties:

//epub3 series metadata
if(series.empty()) {
    struct titleitem { lString16 title; lString16 id; };
    LVArray<titleitem> titles;
    for (int i = 1; i < 20; i++) {
        lString16 xpath = lString16("package/metadata/title[") << fmt::decimal(i) << "]";
        ldomNode *item = dom->nodeFromXPath(xpath);
        if (!item)
            break;
        titleitem curr;
        curr.id = item->getAttributeValue("id");
        curr.title = item->getText().trim();
        titles.add(curr);
    }
    lString16 series_id;
    for (int i = 1; i < 20; i++) {
        lString16 xpath = lString16("package/metadata/meta[") << fmt::decimal(i) << "]";
        ldomNode *item = dom->nodeFromXPath(xpath);
        if (!item)
            break;
        lString16 property = item->getAttributeValue("property");
        lString16 content = item->getText().trim();
        if (property == "collection-type" && content == "series") {
            series_id = item->getAttributeValue("refines");
            if (series_id.startsWith("#"))
                series_id = series_id.substr(1, series_id.length() - 1);
            break;
        }
    }
    if (!series_id.empty()) {
        for (int i = 0; i < titles.length(); i++) {
            titleitem curr = titles.get(i);
            if (curr.id == series_id) {
                series = curr.title;
                break;
            }
        }
        for (int i = 1; i < 20; i++) {
            lString16 xpath = lString16("package/metadata/meta[") << fmt::decimal(i) << "]";
            ldomNode *item = dom->nodeFromXPath(xpath);
            if (!item)
                break;
            lString16 refines = item->getAttributeValue("refines");
            lString16 property = item->getAttributeValue("property");
            lString16 content = item->getText().trim();
            if (refines == series_id && property == "group-positon") {
                series_number = content.atoi();
                break;
            }
        }
    }
}
//if series still empty: last attempt, for property="belongs-to-collection"
if(series.empty()) {
    lString16 series_id;
    for (int i = 1; i < 20; i++) {
        lString16 xpath = lString16("package/metadata/meta[") << fmt::decimal(i) << "]";
        ldomNode *item = dom->nodeFromXPath(xpath);
        if (!item)
            break;
        lString16 property = item->getAttributeValue("property");
        if (property == "belongs-to-collection") {
            series = item->getText().trim();
            series_id = item->getAttributeValue("id");
            break;
        }
    }
    if(!series_id.empty() && series_number == 0) {
        for (int i = 1; i < 20; i++) {
            lString16 xpath = lString16("package/metadata/meta[") << fmt::decimal(i) << "]";
            ldomNode *item = dom->nodeFromXPath(xpath);
            lString16 property = item->getAttributeValue("property");
            lString16 id = item->getAttributeValue("refines");
            if (id.startsWith("#"))
                id = id.substr(1, id.length() - 1);
            if (id == series_id && property == "group-position") {
                series_number = item->getText().trim().atoi();
                break;
            }
        }
    }
}

poire-z commented 4 years ago

All the interesting stuff in bold has since been implemented via some original work. So nothing more to look at for consideration in this no more available fork.

koreader / crengine

Some notes about thornyreader crengine fork #267