richardpaulhudson / holmes-extractor

Information extraction from English and German texts based on predicate logic
MIT License
134 stars 12 forks source link

Question on the framework design #12

Closed riccardopinosio closed 1 year ago

riccardopinosio commented 1 year ago

Hi,

I have a few questions on Holmes as I'm reading the source code and I wish to understand better how it works. It seems that the whole system is based on meta-rules that take the output of the spacy NN models and map the parsed syntactic structure (I assume the one coming out of the dependency parser) to a graph representing the "semantic structure" of the sentence, which is then matched against the search phrases. In this blogpost https://explosion.ai/blog/introduction-to-holmes it is said that "linguistic phenomena like relative clauses [...] can give rise to semantic graph structures that are not trees. The fact that mainstream parsing algorithms are designed to generate trees is one reason why we rely on the combination of standard spaCy models and meta-rules to generate Holmes semantic structures, as opposed to attempting to train a model to produce them directly from raw text."

So if i understand correctly the meta-rules are supposed to "bridge the gap" between the output of the dependency parser and the dags that Holmes uses for matching. I do wonder, however, how this would compare against AMR parsing approaches like https://aclanthology.org/2021.emnlp-main.507.pdf (see also https://github.com/IBM/transition-amr-parser), which are able to directly produce dags capturing the meaning of the sentence from text. Would you expect Holmes to be potentially less accurate but faster than these approaches on large corpora? Would you expect Holmes' approach to be more flexible?

Lastly, Holmes seems written in pure python. Would you expect significant speed improvements if (parts of it) were ported to e.g. c or rust, or is the main computation bottleneck in the spacy models themselves (or in coreferee)?

richardpaulhudson commented 1 year ago

Hi @riccardopinosio, thanks for your interest, comments and questions.

I think if I were starting out with Holmes now rather than in 2017 (although the library was open-sourced in 2019, it's actually around two years older than that), AMRs would have been an obvious potential source for the semantic structures to investigate. However, they're not strictly equivalent to the semantic structures Holmes produces, so you couldn't just swap one out for the other; and I've no idea whether they would actually perform better or worse in terms of accuracy or flexibility with an equivalent library written specifically around them. One important point is that spaCy, and so by extension Holmes, are designed to be usable by people without specialist hardware: I've no idea whether AMR generation is feasible on a CPU?

One point I should make is that Holmes semantic structures are not necessarily dags. Obviously a correct semantic structure should be, but because the meta-rules use features from spaCy dependency parses, morphological analyses, etc. and each spaCy component is trained independently of the others, there are some rare cases where semantic structures end up containing cycles. Rather than trying to prevent cycles from forming when the semantic structures are being generated, Holmes detects and handles them at matching time.

Speed was not a central requirement of the project in which Holmes came into being, and I'm sure that it could be made much faster by writing parts of it in Cython with native methods.