observablehq / feedback

Customer submitted bugs and feature requests
42 stars 3 forks source link

Observable pages saved by the Wayback machine are blank #22

Open jrus opened 3 years ago

jrus commented 3 years ago

Description: Because Observable is entirely reliant on client-side Javascript, every notebook page on observablehq.com fails to render in contexts where pages are cached by third party sites, including on the Wayback Machine.

Steps to Reproduce: Navigate to any Wayback Machine cache of an Observable notebook, e.g. http://web.archive.org/web/20201110101204/https://observablehq.com/@jashkenas/inputs

Expected behavior: Some version of the page should appear, at least containing the basic text output of html and markdown cells, but ideally containing some version that can fully function standalone with Javascript etc. included.

Actual results: Blank gray page with no content. (for beta.observablehq.com links, the result instead was an error page)

Further discussion: Notebooks saved by the Wayback Machine don't need to be editable, or support all of the features of the platform, but it would be great to make them legible in some form. The Wayback Machine is a completely indispensable tool for the web, preserving web history and making interlinks stay meaningful into the future, despite changing fortunes of web businesses and individual site managers. It is a shame that already several years of Observable notebook history has been excluded from that archive. If (heaven forfend) Observable the company and platform ever disappears, a saved copy in the Wayback machine would be invaluable.

I'm sure it would be a nontrivial amount of work to serve some meaningful self-contained static version of every notebook to the Wayback Machine's crawlers, but it would be much appreciated by future readers.

j-f1 commented 3 years ago

The main issues seem to be that:

mootari commented 3 years ago

I tried saving embedded notebooks to the Wayback Machine (both iframe and vanilla Javascript variants) via a combo Github gist + raw.githack.com, but both variants failed to either save or display any content outside of the attribution. The vanilla variant seemed to have problems with static imports.

mootari commented 3 years ago

The way I see it, there are three options (each requires detection of the archiver bot):

  1. Serve notebook's Javascript source:
    • Will at least allow indexing of raw text.
    • File attachments won't get indexed (would have to be referenced in the HTML somehow to be detected as sources ...).
    • Might need special rendering to make sense for the viewer?
    • Reference notebook source explicitely as well, to preserve unaltered compiled script?
  2. Render static HTML:
    • The thumbnailer already shows that this approach is unreliable. There's no way to tell when a notebook is fully loaded, and timeouts would have to favor Observable's limited resources.
  3. Compile a standalone version:
    • Resolve all transitive notebook imports and embed them into the page.
    • Other scripts should already be referenced explicitely so that the Archiver Bot can detect and scrape them.
    • This is roughly equivalent to Observable's "Download code" option.

Notes:

j-f1 commented 3 years ago

It might be worth reaching out to the Internet Archive to see if they can help with this.

jrus commented 3 years ago

The IA people are generally quite interested in saving as much content as they can, so I am sure they would be willing to help (answering questions about what their crawler's capabilities are etc.).