qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

Spec out Ideal interface between Scanner & Walk #16

Open b5 opened 5 years ago

b5 commented 5 years ago

based on our Wednesday coordinating call! cc: @Mr0grog, @lightandluck, @Frijol, @jsnshrmn, @danielballan

Let's use this to conceive of walk as a service with an api that scanner uses to grab snapshots.

Mr0grog commented 5 years ago

First, a quick recap on how Scanner handles and imports data.

The web-monitoring-db project manages all the indexable metadata we have through a REST API. It doesn’t have any logic for interacting with external sources (e.g. Wayback, Versionista) itself — instead, new metadata for versions of pages comes in via a POST to /api/v0/imports. Separate scripts run as cron jobs and use APIs or scrape other services to load data, reformat it, and POST it to the imports endpoint. That metadata has to include a URL for where to retrieve the raw response body of the version and a SHA-256 hash of that data. If the URL is in an acceptable, publicly readable location, -db just stores the URL (so other users will access it later for diffing, etc). If not, -db downloads the content from the URL, verfies it against the hash, and stores it (there are multiple storage providers, but in production we use a publicly readable S3 bucket). (What URLs are acceptable is managed by configuration.)

The imports endpoint accepts data as JSON array or a newline-delimited JSON stream (using the application/x-json-stream content type). The stream is preferred because it’s more efficient to work with. Each post can have any number of versions, but to keep things manageable, we typically chunk them into no more than 1,000 entries per POST. You can also specify some options via a querystring (different sources have different shortcomings, and need slightly different treatment when importing). More docs at https://api.monitoring.envirodatagov.org/#/imports, though we haven’t document it super well. Improvements on those docs at edgi-govdata-archiving/web-monitoring-db#429.

The metadata for each import record generally needs to look like:

{
  // URL of the page that was captured
  "page_url": "https://www.hhs.gov/climate/",
  // the time at which this capture was made (ISO 8601, W3C version)
  "capture_time": "2018-10-08T14:56:36",
  // URL of the raw capture this record represents
  "uri": "http://web.archive.org/web/20181008145636id_/https://www.hhs.gov/climate/",
  // hex-encoded SHA-256 hash of the capture's response body
  "version_hash": "3fe44a54f9b77a12a12550921207d97b832309d6d840ebc5e62ff5f532cd3f12",
  // The title of the page
  "title": "Climate Change and Human Health | HHS.gov",
  // optional list of individuals/orgs who maintain the page
  "page_maintainers": ["NASA"],
  // optional list of tags to add to the page (not the version)
  "page_tags": ["site:NASA - science.nasa.gov"],
  // identifies the source; it's an opaque string
  "source_type": "internet_archive",
  // any kind of useful additional metadata the source can provide
  "source_metadata": {
    "status_code": 200,
    "mime_type": "text/html",
    "encoding": "utf-8",
    "headers": {
      "content-length": "49139",
      "x-varnish": "322830471 321326768",
      "x-content-type-options": "nosniff",
      "content-language": "en",
      "server-timing": "cdn-cache; desc=HIT, edge; dur=44",
      "strict-transport-security": "max-age=31536000;includeSubDomains;preload",
      "x-varnish-cache": "HIT",
      "server": "nginx/1.4.6 (Ubuntu)",
      "content-security-policy": "frame-ancestors 'self' hhs.gov *.hhs.gov",
      "connection": "close",
      "link": "<https://www.hhs.gov/climate/index.html>; rel=\"canonical\",<https://www.hhs.gov/node/4916>; rel=\"shortlink\",<https://plus.google.com/+HHS>; rel=\"publisher\"",
      "date": "Mon, 08 Oct 2018 16:39:13 GMT",
      "x-frame-options": "SAMEORIGIN",
      "x-content-security-policy": "frame-ancestors 'self' hhs.gov *.hhs.gov",
      "x-akamai-transformed": "9 12254 0 pmb=mRUM,3",
      "x-generator": "Drupal 7 (http://drupal.org)"
    },
    "view_url": "http://web.archive.org/web/20181008145636/https://www.hhs.gov/climate/",
    "redirected_url": "https://www.hhs.gov/climate/index.html",
    "redirects": [
      "https://www.hhs.gov/climate/",
      "https://www.hhs.gov/climate/index.html",
      "https://www.hhs.gov/climate/index.html"
    ]
  }
}

The scripts we run as scheduled cron jobs are:

Each of these extracts all the content it can find from a given timeframe in the specified source and formats it for importing as above, then POSTs it to the imports endpoint.

What would we need for Walk?

At the very least, Walk needs to provide a URL at which -db or an ETL script can retrieve raw response bodies, e.g:

GET /captures/{timestamp}/{url}

Since -db really wants the body that would have ultimately been displayed in a browser from a given URL (i.e. after following redirects), I think Walk might want at least two URLs for getting bodies:

# The exact response body for a URL
GET /captures/raw/{timestamp}/{url}

# The *resolved* response body for a URL (that is, the body of the URL that was
# ultimately redirected to) -- this is the URL scanner cares more about.
GET /captures/resolved/{timestamp}/{url}

In the most ideal case, a special handler/finalizer in Walk could do all the metadata formatting and POSTing to -db’s imports endpoint. If that were the case, we’d still need the above URLs, but no special ETL scripts and cron jobs sitting in the middle between Walk and Scanner.

Otherwise, those ETL scripts will need a way to get metadata for all the captures for a given set of domains or URLs over a given timeframe. I can think of a few ways to do that:

Additionally, we need a way to filter to only the URLs we care about:

One other bit worth noting in here is that Scanner uses SHA-256 hashes throughout. Ideally, Walk would just make those available (Wayback is a pain because they use base 32-enocded SHA-1, so we actually have to download the raw bodies and hash them ourselves).

Does that help?

Mr0grog commented 5 years ago

Note about raw response body URLs: it would be great if these had headers a lot like Memento (how Wayback serves them), but did not do Wayback’s behavior of redirecting to the closest-in-time capture if the requested one doesn’t exist or can’t be retrieved/played back.

Mr0grog commented 5 years ago

On the flip side here, if we really needed/wanted Walk not to have to worry about having a persistent TCP or HTTP server, we could work on a way for Scanner to accept the response bodies as part of the POST alongside the metadata. (It's hard because potentially extra large data.)

Frijol commented 5 years ago

Thanks @Mr0grog!

Based on that, and ignoring specific decisions of how anything is done, is this a fair (v zoomed out) listing of requirements to define what Walk is?

REQUIREMENTS (updated per @Mr0grog's comments below)

Core features

Bonus features
Super-extra-someday bonus features
Mr0grog commented 5 years ago

Hmmm, I think there are two main routes to go here:

  1. A walk binary that can run against a list of URLs once and output its metadata and raw captures to disk. Then web-monitoring would write a cron job to run it, followed by a script that uploads the raw bodies to our S3/GCS storage and POSTs the metadata to web-monitoring-db.

    This is exactly how web-monitoring-versionista-scraper is structured — it runs the scrape-versionista script to extract metadata and response bodies from Versionista and then upload-to-s3, upload-to-google, and import-to-db (scrape-versionista-and-upload is a meta-script that just runs those four in a coordinated way). (See the bin dir there.)

  2. Have Walk running as a service reachable by some API (HTTP is nice and easy) and that re-scrapes a list of URLs on a schedule (ideally) or on demand (less ideal). We’d regularly query it for results to upload to web-monitoring-db (or in the ideal case, it would send them to -db itself). This is exactly how web-monitoring-processing works with the Wayback Machine.

What you’re describing above fits with (2), which I definitely agree is better. It’s a lot less work on the web-monitoring team’s side, at least :)

Can take in a CSV of URLs

Is a CSV the right format? (Maybe.) Also not sure, in a requirements sense, what it means to take a list of URLs. This partially depends on the “runs on a schedule” bonus feature. Is taking a list of URLs starting a scrape of all those URLs, or is it storing them to be scraped on a repeating schedule? (Does it also need to receive a schedule with that list? How is the schedule/list edited? Lots of related requirements here.)

Is documented for CLI use

If it’s a service, you wouldn’t really use it via CLI (but we should still document its CLI anyway). What would be core in that case would be documenting its remote (HTTP?) API.

Missing requirement: can output metadata from a scrape (not just the raw data).

Missing requirement: can notify or be polled to determine when a requested scape is complete (how this works is also going to depend on how taking a list of URLs works, see above).

Performs metadata processing needed to input to Scanner (replacing this?)

Maybe? That depends what you’re getting at since that repo is really more a grab-bag of several different tools, and not anything with a single, clear purpose. (It houses ETL scripts for Wayback Machine, health check scripts for Wayback Machine, generic diffing routines, a diffing server, generic modules for interacting with the Internet Archive, generic modules for interacting with Page Freezer [defunct], and some tools for analyzing the significance of changes.) It is a confusing thing to reference as a whole, and yes, that is a big problem.

Anyway! If you mean replacing the need for a separate set of ETL scripts that translate between Walk and web-monitoring-db, yes. (That repo has scripts to do this for the Wayback Machine and is probably where we’d put similar scripts for Walk right now.)

Mr0grog commented 5 years ago

Ugh, sorry it’s been so long since I posted this I forgot we explicitly scoped this issue to the idea of a service. So forget what I wrote about (1) above :P

Frijol commented 5 years ago

Just to close the loop on the conversation @Mr0grog and I had re Walk reaching feature parity with the Versionista scrape:

The metadata in https://github.com/qri-io/walk/issues/16#issuecomment-437785099 that @Mr0grog already listed is canonical for what Walk should have, but our notes have a paste of metadata from a Versionista scrape + extra comments in case useful.

Based on that, I think it's going to be more useful if I wait until that PR lands before making a "it needs x, y, z to reach parity" type of issue.