swerik-project / riksdagen-records

0 stars 1 forks source link

The segmentation error problem in the records #11

Open MansMeg opened 6 months ago

MansMeg commented 6 months ago

In multiple settings, we have identified a problem with segmentation errors. This comes from the OCR process, which sometimes does not identify boundaries between paragraphs. This results in a part of the corpus not being sectioned in the right way, different sections (headers, margin notes, introductions, etc.) being merged, or segments incorrectly being split.

To fix this we need to do two things:

  1. Annotate a segmentation dataset to estimate the general segmentation quality, i.e. the number of incorrectly split and merged text segments. I think this is already done.
  2. Look at the problem by training a BERT model to identify both split sections and merge them and identify merged segments and predict where they should be split.

This is probably a little larger endeavour.