Mediatype: texts - should use dweb instead of direct calls

mitra42 commented 6 years ago

The system currently supports text through Richards player, at some point it needs to work with that player to allow it to be decentralized.

mitra42 commented 6 years ago

Some previous notes on this: See Richard Caceres recent Slack chat of new version ~/git/ia_bookreader : https://github.com/internetarchive/bookreader

Brewster mentioned you were using bookreader, and I should let you know there's a new version that's much easier to use outside of IA ---- OLD notes --- Richard sent some links in Skype - need to either a) use Mek's IIIF reader b) use the bookreader, get the jSON file (could decentralzie) and then inside JSIA is a way to get the page.

view source here: https://archive.org/stream/10_PRINT_121114#page/n0/mode/2up

datafile: https://ia902603.us.archive.org/BookReader/BookReaderJSIA.php?id=10_PRINT_121114&itemPath=/4/items/10_PRINT_121114&server=ia902603.us.archive.org&format=jsonp&subPrefix=10_PRINT_121114&version=aHe9koCh&callback=jQuery11020015266682030218304_1515798833273&_=1515798833274

bookreader initialization library: https://archive.org/bookreader/BookReaderJSIA.js?v=aHe9koCh

image example: https://ia902603.us.archive.org/BookReader/BookReaderImages.php?zip=/4/items/10_PRINT_121114/10_PRINT_121114_jp2.zip&file=10_PRINT_121114_jp2/10_PRINT_121114_0000.jp2&scale=4&rotate=0(90 kB)

mitra42 commented 5 years ago

Notes from revisiting this? Open questions: [ ] How to view PDFs - and/or how to make the .jpg's

Research steps [ ] Look at https://github.com/internetarchive/bookreader BookReaderDemo/demo-simple.html and BookReaderJSSimple.js

What I've found ... Main json control file is : [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=zandvoort.newspapers.1992.zandvoorts.nieuwsblad&callback=jQuery110207786013323137531_1545886524531&_=1545886524532] which says its application/javascript but is actually application/json [ Question posed to Richard ] its not clear to me how to pass this to bookreader.

It contains urls like [https://ia802605.us.archive.org/BookReader/BookReaderImages.php?zip=/9/items/zandvoort.newspapers.1992.zandvoorts.nieuwsblad/1992.Zandvoorts.Nieuwsblad_jp2.zip&file=1992.Zandvoorts.Nieuwsblad_jp2/1992.Zandvoorts.Nieuwsblad_0000.jp2 ] for page0, its not clear to me if these are formulaic but probably doesnt' matter, but for dweb-mirror should be able to pull the zip, and then edit the URLs in the control file before passing to bookreader, for dweb-archive would also have to intercept where BookReader fetches these files.

THere is a strange URL [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=WillieLynchLetter1712&callback=jQuery11020018995238347655485_1545885427175&_=1545885427176] which says its application/json but actually returns application/javascript

Options

fetch PDF etc and view in an IFRAME - need to figure out supported formats
get images as files - need to figure out how to find image urls like above and how to sync those and then pass to Bookreader
get zipfile, and json, edit JSON to use local URLs and pass to bookreader and/or intercept where bookreader pulls the files. (latter would be hard/impossible as running in an iFrame)

mitra42 commented 5 years ago

I’m trying to figure out a strategy to do this in both the Dweb, or offline case, its tricky, in both cases.

For dweb.archive.org I think I have to ….

Pull the metadata (via dweb as usual)
Pull the JSON (via dweb)
Have the gateway server push the images into IPFS etc, and Modify the JSON returned to point at those locations. (Non trivial)
find the place in the book reader where it fetches files and have it go to DwebTransports with those dweb URLS

For dweb-mirror (offline) where there is a local server.

Pull metadata and cache it
Pull JSON (unmodified) from Archive but modify URLs just to strip the hostname before caching and passing to browser.
Pull the Zip file and cache it on local server
Either unzip the file, or find a npm module that can unzip one file at a time.
Book reader will then access local server with URL it can interpret and return each file

mitra42 commented 5 years ago

Done: ./crawl.js --level all zandvoort.newspapers.1992.zandvoorts.nieuwsblad but it missed the big files (>700Mb for the zip)

mitra42 commented 5 years ago

(Note to self - see EN/Dweb - Archive - Text)

mitra42 commented 5 years ago

An example of a text item with multiple "books" try https://archive.org/details/ialerequestsummary Books are one page

mitra42 commented 5 years ago

EDITED: Background info: Multipage books thetaleofpeterra14838gut or alicesadventures19033gut are reasonably small but are displaying as a slide carousel [https://archive.org/search.php?query=mediatype:texts%20AND%20imagecount:8] shows small ones and unitednov65unit is an example

mitra42 commented 5 years ago

[ ] Figure out what switches slide carousel or bookreader

mitra42 commented 5 years ago

From Jeff Kaplan: typically if an item is mediatype=texts and there is an abby and pdf file then it will result in a bookreader presentation. loose images would not result in a pdf or bookreader presentation. and an item with abby and pdf that is mediatype=texts would have no bookreader presentation. it would need to be mediatype=texts.

mitra42 commented 5 years ago

See - #109 for failure case (Peter Rabbit) that should use slide carousel

internetarchive / dweb-archive

Mediatype: texts - should use dweb instead of direct calls #85