Open tfmorris opened 9 years ago
Thanks for delving into this, Tom. Sounds like you have a fix in hand - hopefully the OL folks will be willing to apply it globally. I wonder if there's enough metadata in the output documents to be able to tell if the "chapter start pages" are missing from specific books quickly, though.
I'm not 100% sure I follow what you're asking, but the file with the necessary information is available. For this example, if you look at https://archive.org/download/billgalactichero00harr/billgalactichero00harr_scandata.xml or, more generally, https://archive.org/download/
<pageType>Chapter</pageType>
you can tell whether a particular scan will be affected by this bug. If you find an example where the corresponding _abbyy.gz file is public, I can re-run the processing to test the fix.
Just wondering if there's an easy way to determine which books need to be reprocessed. It seems pretty heavyweight to re-run every book in the collection if only a fraction have this problem. Might also be nice to tag the books with the problem until they're reprocessed, or pull the damaged epubs from the lending interface - but that's dependent on someone inside OL being willing to do that work, I suppose. Maybe just validating your fix and asking them to redo everything is the right answer.
It's not easy for me to do remotely, but it's easy to do at scale at IA. It's simply a matter of scanning every _scandata.xml file for the signature string. The files are typically 10s to 100s of KB, so it's a pretty lightweight job.
It's not a big deal to regenerate all the epubs -- and I'm looking forward to seeing how many books are going to cause print('Unexpected page type: ' + page_type) to fire.
In hindsight that code should have always been there! Good catch, Tom.
Better and better. @wumpus, just curious what the (approx.) total book count of the IA collection is?
@wumpus If IA were to think about regenerating all epubs, I hope they'd consider fixing some of the other bugs first (e.g. there have been a number of reports of poorly performing header/footer identification code).
Oh, and the code is completely untested. Still waiting on example data...
@tfmorris if by "poorly performing" you mean that virtually every epub from OL I've looked at liberally sprays the page headers and footers in among the body text, agreed. But if it's not actually a big deal to regenerate the entire epub corpus, hopefully they'll do that whenever a significant fix hits the toolchain.
@oddhack2 The header/footer code is in common.par_is_pageno_header_footer(), which you can open a separate issue for. It's something you can debug directly with the _abbyy.gz files that you can download, since it's so common.
@tfmorris for this one, you could search in the PD books for something that has a 'chapter' example: pick a collection, download 100 or 1000 _abby.gz files, hack the epub code to crash if 'chapter' or an unknown page_type is present, run the epub code on them, note crashes.
Hm, having asserted this could be done, I tried it, and it's more complicated than it looks. If I get it working before I finish this bottle of wine, I'll report back.
@wumpus I though the abbyy.gz files were not downloadable for anyone not inside the IA team - do I misunderstand? When I try to download from the link @tfmorris gives above, I just get "Item not available The item is not available due to issues with the item's content."
It is a general problem with trying to help out - essentially all the source data is inaccessible, all we can access as external users is the DRMed PDFs and epubs coming out the backend. If there's some efficient way of learning which books are public domain and do have all that data available, that would be great - I have no idea how to determine that and starting from individual edition pages in OL and figuring out which might have the source available on IA is a big manual task I'm not particularly motivated for :-(
I agree this is a bit frustrating, but it is what it is.
Using the command-line tools, if you '-collection:printdisabled' in your search string, that will keep your search to books for which you can download all of the material. For this project, I just selected 1,000 books from the Americana collection which you can download:
ia-mine --search 'collection:americana -collection:printdisabled' --itemlist | head -n 1000 > 1000.not-printdisabled
ia-mine is one of the 2 command-line tools that are extremely useful, check out https://github.com/jjjake/internetarchive and https://github.com/jjjake/iamine
OK, so my search of 1000 books for an americana not-printdisabled book that has an un-handled abbyy page type came up empty. I've got some other downloads running while I sleep. (Ran out of wine.)
Examples of type "chapter"
Unexpected page type: chapter in book advancedalgebras00senk Unexpected page type: chapter in book americanbar1979jcfi Unexpected page type: chapter in book americanbar1986jcfi Unexpected page type: chapter in book americanbar1993jcfi Unexpected page type: chapter in book americanpetroleu02baco Unexpected page type: chapter in book analysedesharns01wies Unexpected page type: chapter in book analysedesharns02wies Unexpected page type: chapter in book analysiscostofre00geph Unexpected page type: chapter in book andestheworldswi00tony Unexpected page type: chapter in book artsdecoration1314newy Unexpected page type: chapter in book artsdecoration1415newy
Also if you're interested in the other page types which are currently skipped... some of these are probably ok ("blank tissue") and others should be fixed, but are lower priority than chapter.
Unexpected page type: blank tissue in book christianunityi000dave Unexpected page type: blank tissue in book guidetosuccessfu00silv Unexpected page type: foldout in book aliensoramerican00grosrich Unexpected page type: foldout in book deeersteschipvaa02rouf Unexpected page type: foreword in book lessonsfromenemy00mcdi Unexpected page type: foreword in book radiumreportofme02newy Unexpected page type: index in book connoisseurillus01lond Unexpected page type: index in book connoisseurillus04lond Unexpected page type: introduction in book artofanaesthesiae2flag Unexpected page type: introduction in book cataractseniletr00fish Unexpected page type: preface in book agriculturalecon00nour Unexpected page type: preface in book airservicemedica00unit
BTW "epub" isn't that hard to get running, here's what I had to do
sudo pip install iamine internetarchive sudo apt-get install libxslt-dev sudo apt-get install libz-dev sudo pip install lxml
And then it was
cd /path/to/downloaded/files ia download ID # or you can download them in your browser, your choice
cd whever-epub-repo-is # needed for it to find its python libraries ./convert_iabook.py ID /path/to/downloaded/files/ID
You need to have downloaded at least the _abbyy.gz, _metadata.xml, scandata.xml, and _jp2.zip or tif.zip, whichever exists.
A bit complicated! I have more examples (392 total) if you want to do more testing.
Looks to me like this is going to be solved internally @ the archive. If you look in _scandata.xml, the addToAccessFormats true/false field was intended to be used to select what pages appear in formats like epub. "chapter" is one of the recently-added pageTypes.
Thank you, @tfmorris, for spotting the problem!
If we'd like to work on page headers/footers (I'm game), I think we should open a separate issue.
Well, the original patch wasn't even valid Python, plus it didn't handle a bunch of the additional types besides Chapter
, but PR #34 has had some limited testing and appears to handle chapter, subchapter, introduction, preface, foreword, and appendix. It also fixes a small cut & paste error where the copyright page had the title "Title Page" instead of "Copyright Page".
If you don't have access to the proprietary commercially licensed JPEG2000 decompressor, you may want to test it in conjunction with the code that's on the openjpeg
branch.
It's been tested against a handful of files: airservicemedica00unit americanbar1993jcfi artofanaesthesiae2flag connoisseurillus01lond lessonsfromenemy00mcdi
Tom, as I mentioned above, this has been fixed a different way in our internal repo already, using the future-proofed addToAccessFormats field. That fix has now been QAed and at some point we're going to rederive all epubs for the past 3 years. It's still on my plate to see if I can get the internal repo synchronized with this one.
OK, color me confused then. The current code already checks addToAccessFormats: https://github.com/internetarchive/epub/blob/master/iabook_to_epub.py#L106 but it also checks for expected page types and I don't see any commits/PRs that change that behavior. Why is this bug still open? A simple commit with "fixes #31" will close it.
Is this secret repo under a different Github organization than https://github.com/internetarchive/ or is it just a private repo under that org for their non-opensource changes.
What exactly is the Internet Archive's position on open source software anyway?
p.s. @gdamdam @wumpus Before wasting the cycles on re-running the ePub job, you should fix some of the other egregious errors. The program was basically abandoned when it was half complete. It's a proof of concept that was put into production without even minimal QA.
cc: @bfalling
From Jon Leach on the ol-tech mailing list:
to go on? I tried creating an account on webarchive.jira.com to see if the reported issue matched what I was seeing - but while I could create an account, I still got a "Permission violation" error page when trying the link Hank provided.
Just for comparison, though, here's an example:
https://openlibrary.org/books/OL7433769M/Bill_the_Galactic_Hero
The epub has 7 pages of scanned boilerplate (cover, title, copyright, etc.), then the first page in the 'Pages' section starts as shown in the attached screendump (which is actually a mix of the dedication page and page 2 of the text, skipping over page 1 of chapter 1 entirely). There may be (probably are but I haven't checked) other missing pages further in the text.
This is representative of what I've seen with lots of other epubs, and the reason I've given up on borrowing anything but PDFs despite the size / slow downloads. The PDF and in-browser versions are not missing pages.