internetarchive / dweb-mirror

Offline Internet Archive project
https://www-dweb-mirror.dev.archive.org/
GNU Affero General Public License v3.0
256 stars 27 forks source link

internetarchive module of IIAB doesn't show some of files available on archive.org #365

Closed vlnn closed 1 year ago

vlnn commented 1 year ago

I can open https://archive.org/details/1984-06-computegazette/ , but not http://box.local:4244/details/1984-06-computegazette/ , as latter emits this log:

Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.374Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReader/plugins/plugin.resume.js?v=891b5f7
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.375Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReader/plugins/plugin.archive_analytics.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.376Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReaderHelpers.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.378Z dweb-mirror:mirrorHttp STARTING: /bookreader/LendingFlow.js
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReader/plugins/plugin.resume.js?v=891b5f7 - 304 - 5.824 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReader/plugins/plugin.archive_analytics.js - 304 - 5.448 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReaderHelpers.js - 304 - 5.196 ms
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/LendingFlow.js - 304 - 4.828 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.384Z dweb-mirror:mirrorHttp STARTING: /bookreader/BookReaderJSIA.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/bookreader/BookReader/plugins/plugin.resume.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/bookreader/BookReader/plugins/plugin.archive_analytics.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/BookReaderHelpers.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.385Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/LendingFlow.js
Nov 13 20:59:53 box internetarchive[1138]: GET /bookreader/BookReaderJSIA.js - 304 - 2.014 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.388Z dweb-mirror:mirrorHttp STARTING: /dweb-archive-bundle.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.388Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/bookreader/BookReaderJSIA.js
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.398Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.238 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.445Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.351 ms
Nov 13 20:59:53 box internetarchive[1138]: 2022-11-14T01:59:53.454Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 20:59:53 box internetarchive[1138]: GET /info - 304 - 1.520 ms
Nov 13 21:00:00 box internetarchive[1138]: 2022-11-14T02:00:00.164Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/dweb-archive-bundle.js
Nov 13 21:00:00 box internetarchive[1138]: GET /dweb-archive-bundle.js - 200 4340209 2.307 ms
Nov 13 21:00:01 box CRON[3711]: pam_unix(cron:session): session opened for user www-data(uid=33) by (uid=0)
Nov 13 21:00:01 box CRON[3712]: (www-data) CMD ([ -x /usr/share/awstats/tools/update.sh ] && /usr/share/awstats/tools/update.sh)
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.230Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 21:00:02 box internetarchive[1138]: GET /info - 304 - 1.282 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.252Z dweb-mirror:mirrorHttp STARTING: /components/manage/manage.css
Nov 13 21:00:02 box internetarchive[1138]: GET /components/manage/manage.css - 304 - 2.025 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.255Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/components/manage/manage.css
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.257Z dweb-mirror:mirrorHttp STARTING: /languages/english.json
Nov 13 21:00:02 box internetarchive[1138]: GET /languages/english.json - 304 - 1.704 ms
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.260Z dweb-mirror:mirrorHttp sent file /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archive-dist/languages/english.json
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.305Z dweb-mirror:mirrorHttp STARTING: /metadata/1984-06-computegazette
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.311Z dweb-archivecontroller:ArchiveItem getting metadata for 1984-06-computegazette
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.312Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette via HTTP
Nov 13 21:00:02 box internetarchive[1138]: 2022-11-14T02:00:02.313Z dweb-transports:httptools p_httpfetch: https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette ''
Nov 13 21:00:02 box CRON[3711]: pam_unix(cron:session): session closed for user www-data
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.201Z dweb-transports:httptools Fetch of https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette opened
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.209Z dweb-transports Fetching https://www-dweb-cors.dev.archive.org/metadata/1984-06-computegazette via HTTP succeeded NaN bytes
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.209Z dweb-archivecontroller:ArchiveItem metadata for 1984-06-computegazette fetched successfully
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.225Z dweb-transports:httptools p_httpfetch: https://www-dweb-cors.dev.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette ''
Nov 13 21:00:03 box internetarchive[1138]: 2022-11-14T02:00:03.665Z dweb-transports:httptools Fetch of https://www-dweb-cors.dev.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette opened
Nov 13 21:00:04 box internetarchive[1138]: 2022-11-14T02:00:04.123Z dweb-transports:httptools GET Uncaught error in callback TypeError: Cannot convert undefined or null to object
Nov 13 21:00:04 box internetarchive[1138]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-archivecontroller/ArchiveItem.js:279:20
Nov 13 21:00:04 box internetarchive[1138]:     at /opt/iiab/internetarchive/node_modules/@internetarchive/dweb-transports/httptools.js:185:9
Nov 13 21:00:04 box internetarchive[1138]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Nov 13 21:00:09 box internetarchive[1138]: 2022-11-14T02:00:09.404Z dweb-transports:httptools p_httpfetch: https://dweb.me/info ''
Nov 13 21:00:10 box internetarchive[1138]: 2022-11-14T02:00:10.122Z dweb-transports:httptools Fetch of https://dweb.me/info opened
Nov 13 21:00:14 box internetarchive[1138]: 2022-11-14T02:00:14.494Z dweb-mirror:mirrorHttp STARTING: /info
Nov 13 21:00:14 box internetarchive[1138]: GET /info - 304 - 5.116 ms

and shows page with only Loading 1984-06-computegazette

mitra42 commented 1 year ago

The link above that failed; goes to a box-specific link; that link returns data, but with a 501. Interestingly its got Content-type application/json, but firefox is not seeing it as JSON, maybe because of the 501 error.

Reported on Internet Archive internal slack ... not sure if this is something needing fixing in dweb-cors or will get fixed internal to IA.

vlnn commented 1 year ago

Thanks a ton!

mitra42 commented 1 year ago

wget -qO- --server-response "https://ia600501.us.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette&itemPath=/15/items/1984-06-computegazette" Returns

  HTTP/1.1 200 OK
  Server: nginx/1.18.0 (Ubuntu)
  Date: Thu, 01 Dec 2022 23:23:13 GMT
  Content-Type: application/x-javascript
  Transfer-Encoding: chunked
  Connection: keep-alive
  Access-Control-Allow-Origin: 
  Access-Control-Allow-Credentials: true
  Strict-Transport-Security: max-age=15724800
  Referrer-Policy: no-referrer-when-downgrade

which is wrong because its JSON

However curl -I "https://ia600501.us.archive.org/BookReader/BookReaderJSIA.php?subPrefix=Compute_Gazette_Issue_12_1984_Jun&server=www-dweb-cors.dev.archive.org&audioLinerNotes=0&id=1984-06-computegazette&itemPath=%2F15%2Fitems%2F1984-06-computegazette&format=json&requestUri=%2Fdetails%2F1984-06-computegazette&itemPath=/15/items/1984-06-computegazette"

returns

HTTP/2 501 
server: nginx/1.18.0 (Ubuntu)
date: Thu, 01 Dec 2022 21:26:28 GMT
content-type: application/json
strict-transport-security: max-age=15724800

which is wrong because its got a status of 501.

Both appear to return valid JSON

No joy getting the bug dealt with at Internet Archive so I'll see if I can patch a workaround.

mitra42 commented 1 year ago

dweb-archivecontroller:ArchiveItem#278 is seeing a "res" which is a buffer rather than an object, Could fix there to turn buffer into JSON but lets go upstream. This comes from... dweb-transports/httptools#p_GET#198 -> GET -> _GET -> p_httpfetch -> Which at line 117 checks for it being Content-type text/json and so of course doesn't convert the erroneous result

Best to fix as close to the problem as possible, and since this code uses dweb-cors to get a cleaned up IA API I'll fix it there. ....

dweb-cors#cors.js added an opt to switch the content-type header That fixed this bug, but there is at least one more bug to fix to get this page to work. .... will look at it later.