For discussion - best thing to do with fields with embedded "."

danizen commented 7 years ago

You can easily see that this page contains Dublin Core metadata: https://www.nlm.nih.gov/news/nlm_announces_departure_of_ncbi_director_david_lipman.html

So, norconex will pass that as fields such as DC.Date.Modified. We want these fields. The regular elastic search committer will do a substitution automatically (I think below @essiembre's code) so that it comes out as DC_Date_Modified. But is that so good? Cannot they appear in any case, and whether the spec. says so or not, someone will do. So, having lots of fields in a rename tagger is also OK.

There's also the intriguing possibility of using this to create structure on the ElasticSearch side, and converting this to "DC": { "Date": { "LastModified": value } } and so on. This is of course almost too magical - something will go wrong.

The simplest and easiest thing to do is to automatically rename the fields to have underscores. Is this worth doing, or is it better to do nothing?

niels commented 7 years ago

Elasticsearch unfortunately handles dots in field names very differently depending on which version is in use. To the best of my knowledge:

Versions up until 2.0 will accept field names with dots verbatim. "foo.bar" and "foo.another.bar" are two fields of exactly those names in a flat hierarchy.
Versions 2.0–2.3 do not allow dots in field names at all. Trying to add such fields will result in an error.
Version 2.4 has a config flag that allows dots to be used again. When set, the behaviour will be as in versions below 2.0.
Versions 5.0 and above accept dots in field names. However, this results in an entirely different data structure than before. Fields will not be stored with their dotted names in a flat hierarchy. Instead, dotted field names will be expanded as you suggested. While addressing fields using dot notation works, they are actually stored as objects. The above example would thus be stored as { "foo": { "bar": …, "another": { "bar": … } } }. Dotted fields from previous versions need to be re-indexed and will then end up being stored as objects as well.

The main issue here is with versions 2.0–2.3 that actively throw errors when encountering fields with dots in their names. I would like to avoid having different versions of this committer target different Elasticsearch versions.

An additional problem is that in Elasticsearch versions 5.0 and above, dotted field names may result in errors as well. Specifically, a document containing both a foo and foo.X fields can not be saved to Elasticsearch as there would be a conflict between the plain field foo and the "rich" field foo: { } (which Elasticsearch would expand foo.X to).

Thus we can not blindly pass along dots in field names. Consequently, the safest and simplest implementation is to always replace them with some never-offending character.

It should also be possible to query the Elasticsearch server for its version and then modify the committer's behaviour accordingly. Besides the additional round-trip this would necessitate, I would also like to avoid the added complexity of such a solution and—crucially—this still leaves us vulnerable to the second problem above.

This brings us to a potential third option. We could expose a config flag that needs to be explicitly set to indicate the desired behaviour. Such a flag could be replace_dots_in_field_names or elasticsearch_version. The latter might seem more future proof (perhaps at some point we will want to add additional version-specific behaviours) but at the same time it is also less explicit. For example, users might conceivably not want dots in their field names even if they were supported by their Elasticsearch version—such as for reasons of interoperability or the aforementioned second problem.

Overall, and assuming someone has time to implement this, I think that a replace_dots_in_field_names configuration option would be the best approach. This would accept a boolean value (false results in field names being sent to ES as-is; true would result in replacement with a default character, probably an underscore) or a sequence of characters with which to replace the dots. This should be a per-index setting that can be overridden per-crawler.

danizen commented 7 years ago

OK - then such a feature would be actively harmful. I will instead attempt to either:

use the ScriptTagger to rename all fields that start with "DC." to "DC_".
contribute a RegexRenameTagger that will handle multiple fields matching a name, which may be easier for me to do since I will be able to test it rigorously.

essiembre commented 7 years ago

See https://github.com/Norconex/committer-elasticsearch/issues/4. Dot replacement is now optional and you can decide what to replace them with (if anything).

danizen commented 7 years ago

Thank you - I noticed this in the code ;)

herimedia / norconex-committer-elasticsearch-rest

For discussion - best thing to do with fields with embedded "." #3