Open anjackson opened 6 years ago
Flattening the Mona Lisa example it to be more in line with the Storm Crawler-approach:
{
"collections" : [
{
"field_collection": "Wikipedia",
"field_collections": [ "Wikipedia" ],
"field_subject": [ "Crowdsourcing" ],
"subdomains": "en.wikipedia.org"
},
{
"field_collection": "Wikipedia",
"field_collections": [ "Wikipedia", "Wikipedia|Main Site", "Wikipedia|Main Site|Mona Lisa" ],
"field_subject": [ "Crowdsourcing" ],
"resource": "http://en.wikipedia.org/wiki/Mona_Lisa",
"dateRange": {
"start": "1970-01-01T00:00:00.000+0000",
"end": "9999-12-30T23:59:59.999+0000"
}
},
...
By stating the fields explicitly with the field_
-prefix and making the filters (subdomains
, resource
, source_file_matches
, dateRange
) all be equal, the code should be simple.
Extending with regexps it a bit tricky, if we want to be flexible. While URL-matching is probably the prime candidate, why not allow matching on any field (assuming the annotations gets applied after the usual extraction)? Maybe url_norm
or url
as default for brevity and an optional explicit one?
{
"collections" : [
{
"field_collection": "MyMatcher",
"includePatterns": [ ".*kittens.*" ],
"customPatterns": [
{
"source_field": "content_type_norm",
"pattern": "html"
}
]
},
...
Problem here is that there are implicit and
between the outer filters and implicit or
between the elements in the pattern-lists. That's confusing. It is also not obvious when arrays are used and when they are not.
The Storm Crawler uses a nice syntax to annotate during indexing, whereas ours (example here) seems clumsy by comparison. Perhaps ours can be simplified and brought more into line with this approach?
Main problem appears to be that we specify the top-level
collection
in a similar way, but also allow membership of multiplecollections
.