internetarchive / dweb-mirror

Offline Internet Archive project
https://www-dweb-mirror.dev.archive.org/
GNU Affero General Public License v3.0
256 stars 27 forks source link

No content shown when searching using dweb-mirror from latest IIAB image #364

Closed vlnn closed 1 year ago

vlnn commented 1 year ago

Testing IIAB's module of internetarchive I can't see the real content. I guess it may be some misconfiguration (either on IIAB's installation or my installation process), but I would appreciate any hints on what to look to.

http://box.local:4244/details/home shows frontpage of Internet Archive. Then I search for e.g. Compute! magazine using search form. Browser returns "Loading search" page and then there's no activity on the page. In logs gathered by journalctl -u internetarchive -f I have suspicious errors on JS-code (see below). I've tried to find similar issues but haven't got any luck. I can reproduce this every time I'm searching via this interface.

Nov 13 07:51:16 box internetarchive[1020]: 2022-11-13T12:51:16.817Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:16 box internetarchive[1020]: GET /info - 304 - 2.583 ms
Nov 13 07:51:18 box internetarchive[1020]: 2022-11-13T12:51:18.611Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:18 box internetarchive[1020]: GET /info - 304 - 2.515 ms
Nov 13 07:51:24 box internetarchive[1020]: 2022-11-13T12:51:24.805Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:24 box internetarchive[1020]: GET /info - 304 - 2.460 ms
Nov 13 07:51:27 box internetarchive[1020]: 2022-11-13T12:51:27.379Z dweb-transports:httptools p_httpfetch: https://dweb.me/info ''
Nov 13 07:51:28 box internetarchive[1020]: 2022-11-13T12:51:28.129Z dweb-transports:httptools Fetch of https://dweb.me/info opened
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.084Z dweb-mirror:mirrorHttp STARTING: /advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.088Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status via HTTP
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.089Z dweb-transports:httptools p_httpfetch: https://www-dweb-cors.dev.archive.org/advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status ''
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.875Z dweb-transports:httptools Fetch of https://www-dweb-cors.dev.archive.org/advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status opened
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.879Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status via HTTP succeeded NaN bytes
Nov 13 07:51:31 box internetarchive[1020]: 2022-11-13T12:51:31.882Z dweb-transports Uncaught error in fetch TypeError: Cannot read properties of undefined (reading 'docs')
Nov 13 07:51:31 box internetarchive[1020]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archivecontroller/ArchiveItem.js:453:49
Nov 13 07:51:31 box internetarchive[1020]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-transports/Transports.js:232:44
Nov 13 07:51:31 box internetarchive[1020]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Nov 13 07:51:38 box internetarchive[1020]: 2022-11-13T12:51:38.343Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:38 box internetarchive[1020]: GET /info - 304 - 2.515 ms
Nov 13 07:51:38 box internetarchive[1020]: 2022-11-13T12:51:38.354Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:38 box internetarchive[1020]: GET /info - 304 - 2.695 ms
Nov 13 07:51:46 box internetarchive[1020]: 2022-11-13T12:51:46.658Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:46 box internetarchive[1020]: GET /info - 304 - 3.297 ms
Nov 13 07:51:51 box internetarchive[1020]: 2022-11-13T12:51:51.245Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:51:51 box internetarchive[1020]: GET /info - 304 - 2.749 ms
Nov 13 07:51:57 box internetarchive[1020]: 2022-11-13T12:51:57.381Z dweb-transports:httptools p_httpfetch: https://dweb.me/info ''
Nov 13 07:51:58 box internetarchive[1020]: 2022-11-13T12:51:58.128Z dweb-transports:httptools Fetch of https://dweb.me/info opened
Nov 13 07:52:19 box internetarchive[1020]: 2022-11-13T12:52:19.083Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:52:19 box internetarchive[1020]: GET /info - 304 - 2.704 ms
Nov 13 07:52:19 box internetarchive[1020]: 2022-11-13T12:52:19.989Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:52:19 box internetarchive[1020]: GET /info - 304 - 2.677 ms
Nov 13 07:52:23 box internetarchive[1020]: 2022-11-13T12:52:23.323Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 07:52:23 box internetarchive[1020]: GET /info - 304 - 2.565 ms
Nov 13 07:52:27 box internetarchive[1020]: 2022-11-13T12:52:27.385Z dweb-transports:httptools p_httpfetch: https://dweb.me/info ''
Nov 13 07:52:28 box internetarchive[1020]: 2022-11-13T12:52:28.122Z dweb-transports:httptools Fetch of https://dweb.me/info opened
vlnn commented 1 year ago

(I'm guessing if it should go to dweb-transports repo issues, but would be grateful for confirmation here)

vlnn commented 1 year ago
CleanShot 2022-11-13 at 14 59 04@2x

This is how the module internetarchive was installed to the IIAB system

mitra42 commented 1 year ago

Looks like you've caught a bug, but its obscure. Thanks for the detailed log report it helps track it down. If you look at the "Fetching" line at 7:51 that caused the error it shows a URL https://www-dweb-cors.dev.archive.org/advancedsearch.php?output=json&q=compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status If I try this in a browser link it also fails. Bug #1: Error handling in the callback at ArchiveItem.js:453:49 isn't catching that error (its presuming there is always a "docs" in the response, - so its being reported at dweb-transports/Transports.js:232:44

This dweb-cors is just to work around a problem with accessing across domains, so I tried direct to archive.org. This also fails. Bug #2: Internet archive advanced search isn't handling ! in a search - reported on internal Archive slack.

Note that if you try without the ! e.g. a search for Compute it works though returns a much too wide set to be useful.

holta commented 1 year ago

@vlnn & @mitra42 profound thanks.

Just FYI: Internet-in-a-Box will be publishing refined images in coming weeks, tightening up some core infra after 3 months.

If Internet Archive searching might be improved alongside these new images, or prior to end of year, that would be incredible (as IIAB 8.0 is expected in final form then!)

mitra42 commented 1 year ago

Great - I think the bug has to be fixed at Internet Archive's end, i.e. not in the image, though error handling could be improved.

Note - this project is not actively maintained at this time, though I'm happy to jump in and look at things occasionally.

vlnn commented 1 year ago

I confirm that search for Compute (without !) works as expected, thanks (so, perhaps Bug#2 could go a bit down in priorities now as workaround is documented here). But do I understand correctly that there's a mismatch between what dweb and archive.org provides? Because I can for sure open https://archive.org/details/1984-06-computegazette/ , but not http://box.local:4244/details/1984-06-computegazette/ , as latter emits this log:

Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.374Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReader/plugins/plugin.resume.js?v=891b5f7
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.375Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReader/plugins/plugin.archive_analytics.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.376Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReaderHelpers.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.378Z dweb-mirror:mirrorHttp STARTING: /bookreader/LendingFlow.js
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReader/plugins/plugin.resume.js?v=891b5f7 - 304 - 5.824 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReader/plugins/plugin.archive_analytics.js - 304 - 5.448 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReaderHelpers.js - 304 - 5.196 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/LendingFlow.js - 304 - 4.828 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.384Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReaderJSIA.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/bookreader/BookReader/plugins/plugin.resume.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/bookreader/BookReader/plugins/plugin.archive_analytics.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/BookReaderHelpers.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/LendingFlow.js
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReaderJSIA.js - 304 - 2.014 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.388Z dweb-mirror:mirrorHttp STARTING: /dweb-archive-bundle.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.388Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/BookReaderJSIA.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.398Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.238 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.445Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.351 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.454Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.520 ms
Nov 13 21:00:00 box internetarchive[1138]: 2022-11-14T02:00:00.164Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/dweb-archive-bundle.js
Nov 13 21:00:00 box internetarchive[1138]: GET /dweb-archive-bundle.js - 200 4340209 2.307 ms
Nov 13 21:00:01 box CRON[3711]: pam_unix(cron:session): session opened for user www-data(uid=33) by (uid=0)
Nov 13 21:00:01 box CRON[3712]: (www-data) CMD ([ -x /usr/share/awstats/tools/update.sh ] && /usr/share/awstats/tools/update.sh)
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.230Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 21:00:02 box internetarchive[1138]: GET /info - 304 - 1.282 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.252Z dweb-mirror:mirrorHttp STARTING: /components/manage/manage.css
Nov 13 21:00:02 box internetarchive[1138]: GET /components/manage/manage.css - 304 - 2.025 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.255Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/components/manage/manage.css
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.257Z dweb-mirror:mirrorHttp STARTING: /languages/english.json
Nov 13 21:00:02 box internetarchive[1138]: GET /languages/english.json - 304 - 1.704 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.260Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/languages/english.json
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.305Z dweb-mirror:mirrorHttp STARTING: /metadata/1984-06-computegazette
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.311Z dweb-archivecontroller:ArchiveItem getting metadata for 1984-06-computegazette
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.312Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette via HTTP
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.313Z dweb-transports:httptools p_httpfetch: https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette ''
Nov 13 21:00:02 box CRON[3711]: pam_unix(cron:session): session closed for user www-data
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.201Z dweb-transports:httptools Fetch of https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette opened
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.209Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette via HTTP succeeded NaN bytes
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.209Z dweb-archivecontroller:ArchiveItem metadata for 1984-06-computegazette fetched successfully
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.225Z dweb-transports:httptools p_httpfetch: https://www-dweb-cors.dev.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette ''
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.665Z dweb-transports:httptools Fetch of https://www-dweb-cors.dev.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette opened
Nov 13 21:00:04 box internetarchive[1138]: 2022-11-14T02:00:04.123Z dweb-transports:httptools GET Uncaught error in callback TypeError: Cannot convert undefined or null to object
Nov 13 21:00:04 box internetarchive[1138]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archivecontroller/ArchiveItem.js:279:20
Nov 13 21:00:04 box internetarchive[1138]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-transports/httptools.js:185:9
Nov 13 21:00:04 box internetarchive[1138]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Nov 13 21:00:09 box internetarchive[1138]: 2022-11-14T02:00:09.404Z dweb-transports:httptools p_httpfetch: https://dweb.me/info ''
Nov 13 21:00:10 box internetarchive[1138]: 2022-11-14T02:00:10.122Z dweb-transports:httptools Fetch of https://dweb.me/info opened
Nov 13 21:00:14 box internetarchive[1138]: 2022-11-14T02:00:14.494Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 21:00:14 box internetarchive[1138]: GET /info - 304 - 5.116 ms

and shows page with only Loading 1984-06-computegazette

Is this the same problem or I better open new issue?

mitra42 commented 1 year ago

Could you open that as a new issue - I can confirm its real and I'll take a look at it - looks like bit-rot at the Archive end.

mitra42 commented 1 year ago

Diving into Bug#1: Search for compute (localhost)[http://localhost:4244/advancedsearch.php?output=json&q=Compute!&rows=30&page=1&sort[]=-downloads&and[]=&save=yes&fl=identifier%2Ctitle%2Ccollection%2Cmediatype%2Cdownloads%2Ccreator%2Cnum_reviews%2Cpublicdate%2Citem_count%2Cloans__status__status] Shows error from dweb-transports in log but the error is actually in dweb-archivecontroller/ArchiveItem.js:454:49 which supplies a callback that doesnt handle an error.

Turns out elasticsearch returns an error in the "error" field of the response rather than an HTTP error, handle that in ArchiveItem.js will fix there in archivecontroller:0.2.16

mitra42 commented 1 year ago

FIxed in 0.2.94 which uses dweb-archivecontroller 0.2.16