qualifiedName issues with graph repository?

cmgrote commented 4 years ago

In testing some changes for lineage, I've noticed some differences in the graph repository vs the in-memory repository when it comes to certain qualifiedNames.

Specifically, qualifiedNames that start with (and elsewhere contain) the following characters are problematic when running with a graph repository configured (knock-on effect higher up the stack -- I don't see an exception from the graph repository itself, but the entities that have such qualifiedNames do not seem to be stored in the graph repository as the exception I am able to see from the OMAS is that these entities do not exist):

The hash symbol #
The ampersand &

Running precisely the same code, OMASes, and data set with the in-memory repository configured in place of the graph repository shows no issues (all stored and retrievable as expected).

Similarly, if I change the qualifiedNames not to start with (or contain) these characters (for example, to use an underscore (_) instead) then the code, OMASes, and data set all work fine with both the graph repository and the in-memory repository.

So for now we can workaround this; however, I suspect this could cause broader issues if there are specific characters that cannot be included at the start or elsewhere within a qualifiedName string in the graph repository (?)

grahamwallis commented 4 years ago

I have done some testing using an entity with qualifiedName containing '#' - either in the first character position or elsewhere in the string, with the same result in both cases, so I think position is not significant, it is the fact that there is a '#' character somewhere in the string.

In all cases the entity is stored in the graph repo and can be retrieved by GUID, for example. What is not working is searches. The findEntitiesxxx API is working fine and accepting the searchCriteria (subject to a couple of caveats noted below) but internally the indexer does not find it when we search using a string that contains a '#' character.

I have not (yet) tested with '&' but I believe that the same result will apply and that the cause (and solution) should be the same.

What's causing it?

Whether we configure the graph to use Lucene or Elastic - or indeed if we used solR - we are always using the Lucene regex grammar. Lucene's regex grammar accepts all the 'standard' (i.e. conventional) regex modifiers and also has syntax options that provide a number of extensions. These extensions are supposed to be optional and as far as I can tell (the documentation is a bit thin) are disabled by default and therefore need to be explicitly enabled. Egeria does not do anything to enable the options, as the normal conventional regex grammar is adequate for Egeria. So it appears that either Lucene's defaults are not as stated in the documentation and the options are already enabled, or JanusGraph could be enabling the options. Either way, from Egeria code we do not appear to have the ability to turn the options off - there are 'syntax flags' on the (Java) RegExp constructor, but that is too low level for us. I cannot find a way to set the syntax flags in the expression - as you can for example with case insensitive matches. On the plus side, the behaviour is (probably) consistent between the different indexers used behind JanusGraph, so at least the problem is consistent and I think it is possible to fix it.

The (not-so-)optional extensions that we are hitting with '#' and '&' are: # -> enables the 'empty' part of the grammar, which I do not fully understand but I think it means that the '#' and anything that follows it is ignored, like a python comment. & -> enables 'intersection' which is rather like a logical AND condition between two expressions.

Possible fixes

We could try to find a way to disable the optional parts of the Lucene grammar. I am not sure if this is possible.
Assuming we cannot disable the optional extensions, we can embrace them as follows:

The graph repository connector assumes that the searchCriteria or string value used in a find request is a fully flexible regex - this provides maximum flexibility for a user who is familiar with and wants to exploit the functions provided by a general purpose regex. But it also comes with a cost in that the user must handle literalization of any special characters. So this style of interface is not very user-friendly but would be very flexible. The user would need to explicitly literalize any '#' and '&' characters. I have tested this and it works as expected, without modification of the code, e.g. by setting searchCriteria to "name-containing-#-character".

For a user who does not want to (or cannot) curate a carefully literalized regex we offer the RepositoryHelper methods that escape the whole expression (by framing it with \Q \E characters. When the graph repository 'sees' one of these framed, escaped expressions, it has to convert it to something that JanusGraph (and Lucene) will accept. The database and indexer do not support the \Q \E escaping syntax, so the graph repository connector parses the expression and specifically literalists the any characters (that should be) significant to the regex processor (in this case Lucene). The repository connector literalists the 'standard' special characters, but I can add literalization of the '' and '&' characters too. In fact we should probably add literalization of all the enforced optional extensions if that is possible.

I am assuming that your tests used \Q \E escaped expressions generated by the repo helper, but please let me know if that is a incorrect assumption.

Edit: I forgot to include the caveats alluded to above. These are that if issue the request through a REST call, it is important to encode the characters - for example issuing a find from a python notebook using REST the # character needs to be encoded (as %23). Special characters can be specifically escaped (under the flexible regex scheme in the first para under section 2 above) using a simple '\' character. So, for example, a working query can be issued as follows:

..../instances/entities/by-property-value?searchCriteria="process-name-containing-\%23-char"

Note that escaping and encoding of the # character.

grahamwallis commented 4 years ago

I just realised I never mentioned that this issue would have affected any string properties, not just qualifiedName.

I've made a change #3676 to add escaping of #, & and < characters. These were the only characters from the Lucene optional extensions that required escaping. Others (~, @, >) work correctly without being escaped. I have tested the fix with repo helper style \Q \E framed regular expressions and with 'raw' regular expressions and get the correct results in all cases.

Hopefully this should address the problem you saw - can you please retest?

odpi / egeria

qualifiedName issues with graph repository? #3563