ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Improve and (somewhat) standardise the annotations system? #179

Open anjackson opened 6 years ago

anjackson commented 6 years ago

The Storm Crawler uses a nice syntax to annotate during indexing, whereas ours (example here) seems clumsy by comparison. Perhaps ours can be simplified and brought more into line with this approach?

Main problem appears to be that we specify the top-level collection in a similar way, but also allow membership of multiple collections.

tokee commented 6 years ago

Flattening the Mona Lisa example it to be more in line with the Storm Crawler-approach:

{
  "collections" : [
    {    
      "field_collection": "Wikipedia",
      "field_collections": [ "Wikipedia" ],
      "field_subject": [ "Crowdsourcing" ],

      "subdomains": "en.wikipedia.org"
    },
    {
      "field_collection": "Wikipedia",
      "field_collections": [ "Wikipedia", "Wikipedia|Main Site", "Wikipedia|Main Site|Mona Lisa" ],
      "field_subject": [ "Crowdsourcing" ],

      "resource": "http://en.wikipedia.org/wiki/Mona_Lisa",
      "dateRange": {
        "start": "1970-01-01T00:00:00.000+0000",
        "end": "9999-12-30T23:59:59.999+0000"
      }
    },
...

By stating the fields explicitly with the field_-prefix and making the filters (subdomains, resource, source_file_matches, dateRange) all be equal, the code should be simple.

Extending with regexps it a bit tricky, if we want to be flexible. While URL-matching is probably the prime candidate, why not allow matching on any field (assuming the annotations gets applied after the usual extraction)? Maybe url_norm or url as default for brevity and an optional explicit one?

{
  "collections" : [
    {
      "field_collection": "MyMatcher",

      "includePatterns": [ ".*kittens.*" ],
      "customPatterns": [
        {
          "source_field": "content_type_norm",
          "pattern": "html"
        }
      ]
    },
...

Problem here is that there are implicit and between the outer filters and implicit or between the elements in the pattern-lists. That's confusing. It is also not obvious when arrays are used and when they are not.