internetarchive / dweb-archive

GNU Affero General Public License v3.0
55 stars 16 forks source link

Mediatype: texts - should use dweb instead of direct calls #85

Open mitra42 opened 6 years ago

mitra42 commented 6 years ago

The system currently supports text through Richards player, at some point it needs to work with that player to allow it to be decentralized.

mitra42 commented 6 years ago

Some previous notes on this: See Richard Caceres recent Slack chat of new version ~/git/ia_bookreader : https://github.com/internetarchive/bookreader

Brewster mentioned you were using bookreader, and I should let you know there's a new version that's much easier to use outside of IA ---- OLD notes --- Richard sent some links in Skype - need to either a) use Mek's IIIF reader b) use the bookreader, get the jSON file (could decentralzie) and then inside JSIA is a way to get the page.

view source here: https://archive.org/stream/10_PRINT_121114#page/n0/mode/2up

datafile: https://ia902603.us.archive.org/BookReader/BookReaderJSIA.php?id=10_PRINT_121114&itemPath=/4/items/10_PRINT_121114&server=ia902603.us.archive.org&format=jsonp&subPrefix=10_PRINT_121114&version=aHe9koCh&callback=jQuery11020015266682030218304_1515798833273&_=1515798833274

bookreader initialization library: https://archive.org/bookreader/BookReaderJSIA.js?v=aHe9koCh

image example: https://ia902603.us.archive.org/BookReader/BookReaderImages.php?zip=/4/items/10_PRINT_121114/10_PRINT_121114_jp2.zip&file=10_PRINT_121114_jp2/10_PRINT_121114_0000.jp2&scale=4&rotate=0(90 kB)

mitra42 commented 5 years ago

Notes from revisiting this? Open questions: [ ] How to view PDFs - and/or how to make the .jpg's

Research steps [ ] Look at https://github.com/internetarchive/bookreader BookReaderDemo/demo-simple.html and BookReaderJSSimple.js

What I've found ... Main json control file is : [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=zandvoort.newspapers.1992.zandvoorts.nieuwsblad&callback=jQuery110207786013323137531_1545886524531&_=1545886524532] which says its application/javascript but is actually application/json [ Question posed to Richard ] its not clear to me how to pass this to bookreader.

It contains urls like [https://ia802605.us.archive.org/BookReader/BookReaderImages.php?zip=/9/items/zandvoort.newspapers.1992.zandvoorts.nieuwsblad/1992.Zandvoorts.Nieuwsblad_jp2.zip&file=1992.Zandvoorts.Nieuwsblad_jp2/1992.Zandvoorts.Nieuwsblad_0000.jp2 ] for page0, its not clear to me if these are formulaic but probably doesnt' matter, but for dweb-mirror should be able to pull the zip, and then edit the URLs in the control file before passing to bookreader, for dweb-archive would also have to intercept where BookReader fetches these files.

THere is a strange URL [https://openlibrary.org/query.json?type=/type/edition&*=&ocaid=WillieLynchLetter1712&callback=jQuery11020018995238347655485_1545885427175&_=1545885427176] which says its application/json but actually returns application/javascript

Options

  1. fetch PDF etc and view in an IFRAME - need to figure out supported formats
  2. get images as files - need to figure out how to find image urls like above and how to sync those and then pass to Bookreader
  3. get zipfile, and json, edit JSON to use local URLs and pass to bookreader and/or intercept where bookreader pulls the files. (latter would be hard/impossible as running in an iFrame)
mitra42 commented 5 years ago

I’m trying to figure out a strategy to do this in both the Dweb, or offline case, its tricky, in both cases.

For dweb.archive.org I think I have to ….

For dweb-mirror (offline) where there is a local server.

mitra42 commented 5 years ago

Done: ./crawl.js --level all zandvoort.newspapers.1992.zandvoorts.nieuwsblad but it missed the big files (>700Mb for the zip)

mitra42 commented 5 years ago

(Note to self - see EN/Dweb - Archive - Text)

mitra42 commented 5 years ago

An example of a text item with multiple "books" try https://archive.org/details/ialerequestsummary Books are one page

mitra42 commented 5 years ago

EDITED: Background info: Multipage books thetaleofpeterra14838gut or alicesadventures19033gut are reasonably small but are displaying as a slide carousel [https://archive.org/search.php?query=mediatype:texts%20AND%20imagecount:8] shows small ones and unitednov65unit is an example

mitra42 commented 5 years ago

[ ] Figure out what switches slide carousel or bookreader

mitra42 commented 5 years ago

From Jeff Kaplan: typically if an item is mediatype=texts and there is an abby and pdf file then it will result in a bookreader presentation. loose images would not result in a pdf or bookreader presentation. and an item with abby and pdf that is mediatype=texts would have no bookreader presentation. it would need to be mediatype=texts.

mitra42 commented 5 years ago

See - #109 for failure case (Peter Rabbit) that should use slide carousel