Eliminate grammar parser/blockly interface overlap

The pressing issues with this part of the pipeline are with robustness, scalability and testing. For the final product, we need a lot of simplifications. To organize and document the development, I will be tracking that in the issue tracker here.

Currently, if I understand correctly, the procedure can be roughly sketched as follows. I will edit as I go along; please comment if I am mistaken.

The question is cleaned. nltk is used to detect and clean adjectives like 'nearest', so that the important nouns can be isolated and recognized in subsequent steps.
Important words in the questions are annotated.
1. Recognize concepts, amounts and proportions via a pre-defined concept dictionary.
2. Recognize place names via ELMO-based named entity recognition (NER) from allennlp.
3. Recognize times, quantities and dates via NER from spaCy.
Extract functional roles based on syntactic structures and connective words, via a grammar implemented in ANTLR. This yields a parse tree.
Convert parse trees into transformations between concept types.
1. Find input and output concept types by matching questions to categories that are associated with concept transformation models.
2. The order of concept types is derived from the function of each phrase in which they occur: subcondition is calculated before the condition, etcetera. A table is generated that calculates the order for each functional part, which is then itself combined in a rule-based way (see Algorithm 1 in the paper).
Transform concept types into cct types via manually constructed rules based on the concepts/extents/transformations that were found in previous steps.

The issue is that this is rather fragile; it depends (among other things) on:

All concepts and entities being annotated properly.
Having a complete rule set for converting concept types into CCT types.

We have chosen blockly to constrain the natural language at the user end, in such a way that the questions that may be presented to the parser are questions that the parser can handle. However, this only formats the question to reduce the problems of an otherwise unchanged natural language processing pipeline. As discussed in the meeting and elsewhere:

Given that we already know the type of entity when constructing a query via blockly instead of freeform text, we will no longer need named entity recognition or question cleaning. This would strip out the nltk, spaCy, and allenlp packages, tremendously simplifying the process.
To guarantee robustness, the visual blocks need to be in perfect accordance with the parser. For this, they should be automatically constructed from one common source of truth.
In fact, given that the blockly-constructed query can output something different than what's written on the blocks, we might even forego the natural language parser completely, in favour of JSON output at the blockly level (or another format that is easily parsed). This would eliminate even the ANTLR parser, further reducing complexity. The downside is that we would no longer be able to parse freeform text (though that would be impacted by the removal of named entity recognition anyway). We could describe this with JSON Schema to really pin it down.
To make sure that no regressions are introduced, we should have expected output for every step (that is, not just expected output from the whole pipeline).

This would make this repository not so much a geo-question-parser as much as a geo-question-formulator. This is good, because the code right now is very complex and very fitted to the specific questions in the original corpus, which isn't acceptable in a situation where users can pose their own questions.

Note: If we simplify to this extent, it might be nice to use rdflib.js to output a transformation graph directly, but that is for later.

The process would thus become:

In blockly, construct JSON that represents a question.
Convert that question into transformations between concept types.
1. Find input and output concept types by matching questions to transformation categories.
2. Find concept type ordering.
Transform concept types into cct types via rules.

I'm not sure to what extent we can still simplify step 2. Depending how much code would be left, it would be nice to port/rewrite in JavaScript, alongside blockly, so that we can visualize most things client-side and with minimal moving parts.

Thanks @HaiqiXu for the explanations in our meeting :) I have a better understanding now (I made some edits to the procedure sketch in my original post).

It turns out that our assumption (Simon's, Enkhbold's, mine) that the grammar parser and the Blockly interface do more or less the same thing (that is, implement the grammar) was not correct. This explains Haiqi's hesitation and means that unifying the grammar parser and the Blockly interface into a single module isn't as straightforward as we might have at first supposed.

That's because the recognition of the core concepts influences the way that grammatical components can combine.[^1] Using the concept dictionary, some words are converted into the concepts they represent. It is only then that the phrase is passed to the parser, which identifies the functional roles.

[^1]: This is not true for steps that only "clean" the questions, and it also does not seem to be true for the named entity recognition: wherever a named entity can occur, we should already know what type that entity has because of the block it occurs in.

This doesn't mean that we should keep the architecture as it is right now, because problems[^2] remain:

[^2]: Disregarding issues associated with the conversion of the parse tree into a graph or the conversion of concept types into into cct types, since they're not due to the interface/parser.

There is overlap in functionality: both the blocks and the functional grammar seek to capture complexities of natural language, which is a hard enough problem to contend with once. This means that (1) there are two points of failure, (2) they may even interact in unpredictable ways, and (3) deep knowledge of both subsystems is necessary to make any change, since they must be maintained in parallel.
The parser is not resilient against the case where a word is not converted into any concept at all. This will inevitably happen time and time again when users are allowed free input.
Information is not passed between the two subsystems. In Blockly, we know exactly where a concept word can occur (we just don't know what type it is). We could already start identifying the concept and deal gracefully with any problems, or at least fail early. Instead, we throw this structural information away and let the parser throw a less understandable error at a later point.

I see two ways in which we can address these problems:

We might be able to coax blockly into producing dynamic blocks, where the content of inner blocks (ie. recognized concept types) can influence the output of outer blocks. We could then generate dynamic blocks from a straightforward representation of the grammar. This is elegant and would limit duplication, but could be difficult.
Failing that, we should at least disentangle the functionalities of both systems:
1. The Blockly interface has the exclusive responsibility of constraining natural language. It should produce a structured representation of important information, and anything it outputs should always be parseable (since the limited places where freeform text occurs can be labelled as such).
2. The grammar has the exclusive responsibility of identifying functional roles. It would likely look similar to what it is now, but its input would be structured, so we don't have to worry about it being fragile.

I'm not yet sure about either of these approaches, so I'm going to play with Blockly and the grammar and see what makes sense.

Thinking about it more, I think the best approach is to create a declarative representation of the grammar that carries enough information to:

Automatically generate a Blockly interface
Automatically generate the procedures for determining the functional roles from Blockly's output

This way, the grammar can look much like Haiqi's original ANTLR implementation, with the associated ease of editing - but we can still get rid of the parsing step. Also, this would avoid spaghetti coding the two steps in an overzealous effort to unify them (despite the current overlap, there really are conceptually separate procedures at the cores of the two systems).

We will also need a tool to check whether all questions can still be represented with Blockly, because it would be a chore to test otherwise. This will be discussed in another issue when we get to it.

I have created a merge-with-interface branch to this repository, to track the process of unifying the two subprojects. This branch is relevant for all subsequent issues (#2, #3, #4, #5, #6, #7, #8, #9).

While I expect the main and haiqi branches to become legacy, they may still receive updates while work is done on the interface. They also probably won't be removed, since the way things are done there corresponds to how things are done in the papers. In other words, it makes sense to continue working on these branches.

However, changes to the Blockly interface should probably not be made to https://github.com/quangis/quangis-web or https://github.com/HaiqiXu/haiqixu.github.io, but here, instead. (And, in time, removed from quangis-web.)

For reference: I can verify that the recognition of core concepts indeed influences the parsing process, as mentioned in this comment. Consider:

serviceObj
    :   ((time|quantity) 'and'?)+
        'of'? networkQ (('from'|'for'|'of') origin)?
        ('to' destination)?;

networkQ is a recognized entity that is explicitly separate from other coreC.

Following up on the previous comment, we need a separate module for recognizing concepts. This module would have only one responsibility: converting a string into a concept.^[1] This is a difficult task on its own, so it's imperative that it is understandable and replacable (!) in isolation, not muddled by other concerns like parsing a whole sentence. It can be wrapped up in a service, depending on where it is needed.

[^1]: Ideally, the concept would be described by a URI.

Now, suppose that we accept, for example, a network in a block where the grammar would require a field. Is that a problem?

If the concept inside the block fundamentally influences the context in which it can be applied --- which functional roles are ascribed to it by the grammar --- then things get a bit complicated, because we must limit the blocks to which it can be attached.
1. We could give the user two otherwise identical blocks with different example text, and have them intuitively infer which one they can use. This places some of the burden of understanding GIS concepts on the user.
2. We can have a dynamic block that communicates with the concept recognition module to dynamically change the block's shape depending. This involves an uncomfortable amount of magic, both in implementation and in user experience.
We can do a combination of these.
If, instead, the block can still be sensibly used in all contexts, with the functional roles unaffected, then the concept recognition only serves to pin down the types. This can be done entirely on the server end after the user has finished constructing their queries, as a part of the conversion to the transformation graph.
If the block can be used sensibly in most contexts, but not all, then we can still do everything on the server end and just throw an error in those few cases that the blockly construction does not accord with the grammar. In this case, we probably want to have the blocks output some extra information for verification, so that we don't have to maintain two grammar implementations just for those edge cases.

In any case, the concept recognizer should be isolated.

quangis / geo-question-parser

Eliminate grammar parser/blockly interface overlap #1