How to cluster words into semantic entities, when performing information extraction?

Hi everyone,

I've got a question regarding information extraction from forms. Not sure if it's the right location to use to ask it. Don't hesitate to tell me if you think this post should be made elsewhere instead.

I've been searching the web about this for some time now, but I have not found a satisfactory answer yet. My objective is to automatically extract the content of a form as a list of (QUESTION; ANSWER) pairs.

There are tutos and notebook dedicated to showing how to perform information extraction by using LayoutLMForTokenClassification of the "transformers" python library (ex: here; or here).

However, with this, all we achieve is the labeling of words. To actually achieve my desired result, we need to do two more things:

A] First, we need to group together words into semantic entities; i.e. to group together the words which make up the name of a field / make up a QUESTION; or the words which constitute an ANSWER (for instance: first the word with the value and then the word for the unit, e.g. "3.5 kg").

B] Then we need to determine / extract the relations between those entities. For instance, we need to find the relation linking one semantic entity which is a QUESTION, with one which is the associated ANSWER (which may also not exist).

There already exist some resources dedicated to performing the RelationExtraction part. Ex: by you, the people of Microsoft working on unilm, who implemented the LayoutLMv2ForRelationExtraction model (one tuto notebook is available here). In fact, one PR had been opened on the "transformers" repo in order add this model class. The latest version of the PR is here.

However, in order to work, this model assumes that we already know the semantic entities, hence the need for the A] step. But then, how do we achieve this step? That's what I'm struggling with. The output of the use of the LayouLMTokenClassification will look something like this: display_words_boxes

However, to use the RelationExtraction model, we need something like this instead. display_entities_boxes

Sure, we have this info for labelled data, but not at inference time, for a whole new document.

So, how do we do that for new documents? The order in which the words is output by the OCR may not be consistent with the order in which we actually need to consider the words, if we just rely on the label values to perform the decoding / the build of the semantic entities.

We can model the problem in various ways: notably, this can be though up as finding the edges between the nodes of a directed graph, where the indegree and outdegree values of each vertex are at most 1, and there is no cycle.

I have crafted a simple algo based notably on computing the distance between the bounding box of the words, to account for the info contained in the spatial locations of the words relative to one another, but it's far from perfect, and there are cases when it will fail to produce the correct result.

Seems to me like it's the job the LayouLM -based model in the first place to consider both the semantic information and the spatial information of the words, in order to perform the proper labeling within the LayoutLMForTokenClassification model (notably, to distinguish the "B-" labels from the "I-" label, if we refer to the FUNSD dataset), so it's a shame that this info ("to which semantic entity belongs each token / each word") is not output by it.

Does anyone have any idea regarding how to carry out the step A]? Notably, do we need to craft a dedicated model, such as LayoutLMForTokenClassification and LayoutLMv2ForRelationExtraction, in order to achieve this? Or is it possible to somehow re-use / upgrade the LayoutLMForTokenClassification model, in order to produce outputs which allows to carry out the semantic entities construction task?

I have implemented this in my project marie-ai , work in progress. Code is modular so you can extract the code and use each piece independently.

Here is a reference to the code that you could check how it was implemented https://github.com/gregbugaj/marie-ai/blob/main/marie/executor/ner/ner_extraction_executor.py

In the documentation /docs/models/named-entity-recognition you can see how the config and data have been structured.

In a nutshell I have a annotated dataset using CVAT and then converted into FUNSD like format from COCO format supplied by CVAT and the finetuned using unilm

In the model we have specific tags rather than generic Question/Answer that FUNSD have, there is a tool that will do that conversion there.

Sample Usage

  from marie.executor import NerExtractionExecutor
  from marie.utils.image_utils import hash_file

  # setup executor
  models_dir = ("/mnt/data/models/")
  executor = NerExtractionExecutor(models_dir)

  img_path = "/tmp/sample.png"
  checksum = hash_file(img_path)

  # invoke executor
  docs = None
  kwa = {"checksum": checksum, "img_path": img_path}
  results = executor.extract(docs, **kwa)

  print(results)

Config snippet:

{
    "question_answer_map" : {
        "member_name": "member_name_answer",
        "member_number": "member_number_answer",
        "pan": "pan_answer",
        "dos": "dos_answer",
        "patient_name": "patient_name_answer"
    }
}

Results

 {
      "page": 0,
      "category": "DOS",
      "value": {
        "question": {
          "line": 13,
          "key": "DOS",
          "bbox": [
            97.774,
            1975.2,
            355.074,
            55.964
          ],
          "score": 0.999998,
          "text": {
            "text": "DATE OF SERVICE:",
            "confidence": 0.9999
          }
        },
        "answer": {
          "line": 13,
          "key": "DOS_ANSWER",
          "bbox": [
            532.611,
            1975.2,
            432.264,
            52.672
          ],
          "score": 0.999642,
          "text": {
            "text": "7/25/2022 - 7/25/2022",
            "confidence": 0.9997
          }
        }
      }

"So, how do we do that for new documents? The order in which the words is output by the OCR may not be consistent with the order in which we actually need to consider the words, if we just rely on the label values to perform the decoding / the build of the semantic entities."

For this I have a custom TextExtractionExecutor that performs bounding box detection, line aggregation and ICR. again you can look at the source.

Example :

        executor = TextExtractionExecutor()
        results = executor.extract(docs, **kwa)

        print(results)
        store_json_object(results, os.path.join("/tmp/fragments", "results.json"))

Results

      {
        "id": 188,
        "text": "2699.00",
        "confidence": 0.9999,
        "box": [
          1897,
          1998,
          145,
          29
        ],
        "line": 60,
        "word_index": 240
      },

 {
        "line": 3,
        "wordids": [
          223,
          225,
          229
        ],
        "text": "PO BOX 39034",
        "bbox": [
          2402,
          164,
          107,
          30
        ],
        "confidence": 0.9989
      },

This gives you text line text in right order, and their wordid.

Hi Grebugaj,

Thank you for your answer.

I've looked and investigated the code of marie-ai that you referred to. If I got this right:

In order to aggregate words into semantic entity, you assume that all the words of a given semantic entity are contained on a single line only, and that the words are given in the reading order. It does not seem like the order in which 'B-' or 'I-' labelled words are given is taken into account, all that matters to put them into the same entity is that they belong to the same entity type, as well as following the order in which the words of the whole document are given. To compensate for possible errors in the output of the TokenClassification model, regarding the tokens' label, you allow the user to ask to merge together words which would be found "in the middle" of a semantic entity of another type.
- One line is defined as a bbox encompassing possibly several words, and is output by the OCR tool being used. So there may possibly be more than one of such line at a given vertical position on a document / on a 'line' (whole width) of the document? Also, for now, not sure whether such a 'line' can encompass words located over several true lines (at different vertical positions) of the document.
In order to find the (QUESTION; ANSWER) pairs, the algo that you use relies on the output of the above process to aggregate words into semantic entities, with the constraint that all the words of a given entity must belong to the same 'line', by assuming that the reading order is respected: after a QUESTION has been found, associate to it as a new (QUESTION; ANSWER) pair all the entities which follows according to reading order, and which are a ANSWER compatible with the QUESTION, until another QUESTION entity is met.

What you said about the TextExtractionExecutor is interesting: basically, you tackled the issue by training a model specifically to output the data such that the assumptions made by the process that I've described above would be met. I tried to execute the text extractor on a document of my own, to see what the output would look like, and to make sure that I had properly understood that concept of 'line' that is being used by the above process. But of course I lack the proper model weights ("Loading from ./model_zoo/unilm/dit/text_detection/td-syn_dit-l_mrcnn.pth"), so in the end this failed and I was not able to do that.

But anyway, the gist of it is that it seems like we indeed need a machine-learning model to process the data after performing token classification, in order to prep it for the next step of the pipeline, and notably, for (QUESTION; ANSWER) extraction. In that case, I think I will investigate working from the LayoutLMForTokenClassification model, in order to build another model whose role would be to directly output the semantic entities, or at least to associate to each token their id identifying their semantic entity as well as their position within it. Indeed, assuming that the LayoutLMForTokenClassification model does its job properly, it already needs to implicitely know this info, in order to distinguish between tokens whose label should begin by 'B-', and those whose label should begin by 'I-'; so in theory it should just be a matter of making this info available in the output of the model.

Anyway, thanks again for sharing your work regarding how you tackled this problem!

microsoft / unilm

How to cluster words into semantic entities, when performing information extraction? #923