Improve and (somewhat) standardise the annotations system?

Flattening the Mona Lisa example it to be more in line with the Storm Crawler-approach:

{
  "collections" : [
    {    
      "field_collection": "Wikipedia",
      "field_collections": [ "Wikipedia" ],
      "field_subject": [ "Crowdsourcing" ],

      "subdomains": "en.wikipedia.org"
    },
    {
      "field_collection": "Wikipedia",
      "field_collections": [ "Wikipedia", "Wikipedia|Main Site", "Wikipedia|Main Site|Mona Lisa" ],
      "field_subject": [ "Crowdsourcing" ],

      "resource": "http://en.wikipedia.org/wiki/Mona_Lisa",
      "dateRange": {
        "start": "1970-01-01T00:00:00.000+0000",
        "end": "9999-12-30T23:59:59.999+0000"
      }
    },
...

By stating the fields explicitly with the field_-prefix and making the filters (subdomains, resource, source_file_matches, dateRange) all be equal, the code should be simple.

Extending with regexps it a bit tricky, if we want to be flexible. While URL-matching is probably the prime candidate, why not allow matching on any field (assuming the annotations gets applied after the usual extraction)? Maybe url_norm or url as default for brevity and an optional explicit one?

{
  "collections" : [
    {
      "field_collection": "MyMatcher",

      "includePatterns": [ ".*kittens.*" ],
      "customPatterns": [
        {
          "source_field": "content_type_norm",
          "pattern": "html"
        }
      ]
    },
...

Problem here is that there are implicit and between the outer filters and implicit or between the elements in the pattern-lists. That's confusing. It is also not obvious when arrays are used and when they are not.

ukwa / webarchive-discovery

Improve and (somewhat) standardise the annotations system? #179