sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
11 stars 0 forks source link

Search within work: evaluate Internet Archive BookReader #2463

Closed eddierubeiz closed 4 months ago

eddierubeiz commented 9 months ago

We now have a fair number of works with full text. It's time to turn our attention to getting search results bringing the user to a specific page within the work (instead of to the work page.)

Use case: you search for the word "owl", and for each matching work, you get a set of links to each matching page, instead of a single link to the work.

Based on the discussion in the wiki, evaluate the Internet Archive BookReader, as a way to provide this functionality.

For now, this will not be part of the digital collections code, but a proof of concept that will stand alone, using a sample of a few pages from the digital collections.

eddierubeiz commented 7 months ago

I'm wrapping up work on this. My report is at: https://sciencehistory.atlassian.net/wiki/spaces/HDC/pages/2285371393/Internet+Archive+BookReader

jrochkind commented 7 months ago

For the record, some emails from Drini, BookReader developer.


Hi Jonathan and Eddie,

I'm glad you like the UX! And fantastic work on your archive, that seems like a great initiative -- and very well designed, too. It seems like a good use case for BookReader. We are actively maintaining BookReader for use on archive.org , so it will be getting regularly updated. But there might be some bumps when trying to integrate it into third party sites since we don't do that very often, but I'm happy to help iron out any issues you discover.

You are correct, the IIIF demo was very out-of-date. I spent some time and just opened a PR to fix it up. If you spin up a IIIF server, you can paste a manifest URL here to try it out: https://deploy-preview-1312--lucid-poitras-9a1249.netlify.app/bookreaderdemo/demo-iiif . IIIF is convenient since it's a standard that works with a number of different viewers, and it's used by a lot of archives and libraries, so you'd be in good company!

For zoom levels, yep it does support it! It needs the width/height of each page, and then you can specify the getPageURI option to be a custom function. This function is given the page index and the reduce factor -- i.e. how much do shrink the image. For Internet Archive, we do something like:

     const builtInGetPageURI = BookReader.prototype.getPageURI;
     options.getPageURI = function(index, reduce) {
       const = builtInGetPageURI.call(this, index, reduce, rotate);
       uri += uri.indexOf('?') > -1 ? '&' : '?';
       uri += 'scale=' + reduce;
       return uri;
     };

And that gets it working with our API, which uses a scale url parameter.

For the text features (search, text selection, text-to-speech), that should also be possible, but will require exposing some options/API shapes which we might not have exposed before. I'll help code review any changes you want to make to BookReader core, or help massage it in any way myself as well if necessary.

TLDR: It's possible and I believe feasible, but it will take a bit of massaging to work out the wrinkles 😁 Let me know if you have any other questions!


Hi Jonathan,

Yep your interpretation of reduce is accurate. I don't believe we have a limit on the reduce, so it should be able to go larger than 4 -- I just tested with some of our books and it goes to at least 8. Are you inputting the correct/maximum width/height in the data section for your options object? That influences how much it zooms in/out. If your APIs support it, you can also switch to non powers of twos by changing the option reduceSet from pow2 to integer. That'll forward along a reduce that is any integer. But the caveat with that is that it loads more often; that's why we use pow2 as our default.

Yeah, BookReader doesn't currently support tiling for its images! That's something we're interested in, but isn't on a near-term road map right now since the performance has been sufficient for our needs. For reference our images are usually full colour as well, and at reduce=1 can be ~5000x7000 JPGs at ~4.5MB , and they work well. But we also self-host! You'll have to do some testing and see if it's sufficient for you.

Oh that's awesome that you got search working, fantastic! If you have your code/demo up somewhere I'd love to take a look and see if there's anything we can improve on our end to make integrations like this easier! And also just to play around with it for fun :D

Yes rotate would be a super handy feature! It's one of the more often requested ones by our users :P If you want to give it a tackle, please! I'm aiming to work on it this year, but likely not for a few months.

Text Selection is easier to integrate than search, honestly :D You just need to convert your OCR file to DjVu XML format, and specify the endpoint where BookReader can fetch a single page-worth of DjVuXML. Here's a sample DjVu XML: https://ia903103.us.archive.org/14/items/goodytwoshoes00newyiala/goodytwoshoes00newyiala_djvu.xml

Would you like to be added to our slack channel? I can more quickly answer your questions there if you have them :) Let me know if so, and I'll add your names/emails as they appear here!


jrochkind commented 6 months ago

From Drini:

Yeah the setup is the hardest part! I've opened a PR that updates the README to include the actual current setup. It's kind of annoying, but it works. https://github.com/internetarchive/bookreader/pull/1321

The webcomponent is just a wrapper shell around bookreader, so the rest of the setup is the same.

See more at https://github.com/internetarchive/bookreader/pull/1321