proycon / analiticcl

an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction
GNU General Public License v3.0
31 stars 4 forks source link

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

Closed HoekR closed 2 years ago

HoekR commented 2 years ago

in python I have a VariantModel, called xmodel and

lexicon (fn) : schutte_1672_names.txt


result = defaultdict(list)
resoluties = resolutions_1672
xmodel = VariantModel(os.path.join(bdir, "examples/simple.alphabet.tsv"), Weights(), debug=False)
xmodel.read_lexicon(fn)
xmodel.build()

text: Ontfangen een Missive van het Collegie ter Admiraliteijt opde Maze, geschreven tot Rotterdam, den 28en. deses, houdende ingevolge ende tot voldoeninge van haer Ho:Mo: resolutie vanden 17en. daer te vooren der selver consideratien ende advis op de reqte. van Lijsbeth Andries Huijsvrouw ende Maertie Jans, moeder van Jan Jansz van Delff, versoeckende dat de voorn Jan Jansz, uijt het spin„ huijs der Stadt Rotterdam souden mogen werden ontslagen, mits hem in dienst van desen Staet te water off te Lande begevende, en is bij die occasie mede gelesen de nadere requeste vande voorsr. Lijsbeth Andries en Maertie Jansz, noghmaels de voorsr ontslaginge versoeckende: Waerop gedelibereert sijnde, Is goetgevonden ende ver„ staen, dat int voorsr versoeck niet en can werden getreden.

xmodel.find_all_matches(text, SearchParameters(max_edit_distance=3))

result:


[{'input': 'Ontfangen', 'offset': {'begin': 0, 'end': 9}, 'variants': []},
 {'input': 'een', 'offset': {'begin': 10, 'end': 13}, 'variants': []},
 {'input': 'Missive', 'offset': {'begin': 14, 'end': 21}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 22, 'end': 25}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 26, 'end': 29}, 'variants': []},
 {'input': 'Collegie', 'offset': {'begin': 30, 'end': 38}, 'variants': []},
 {'input': 'ter', 'offset': {'begin': 39, 'end': 42}, 'variants': []},
 {'input': 'Admiraliteijt',
  'offset': {'begin': 43, 'end': 56},
  'variants': []},
 {'input': 'opde', 'offset': {'begin': 57, 'end': 61}, 'variants': []},
 {'input': 'Maze', 'offset': {'begin': 62, 'end': 66}, 'variants': []},
 {'input': 'geschreven', 'offset': {'begin': 68, 'end': 78}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 79, 'end': 82}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 83, 'end': 92}, 'variants': []},
 {'input': 'den', 'offset': {'begin': 94, 'end': 97}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 100, 'end': 102}, 'variants': []},
 {'input': 'deses', 'offset': {'begin': 104, 'end': 109}, 'variants': []},
 {'input': 'houdende', 'offset': {'begin': 111, 'end': 119}, 'variants': []},
 {'input': 'ingevolge', 'offset': {'begin': 120, 'end': 129}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 130, 'end': 134}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 135, 'end': 138}, 'variants': []},
 {'input': 'voldoeninge',
  'offset': {'begin': 139, 'end': 150},
  'variants': []},
 {'input': 'van', 'offset': {'begin': 151, 'end': 154}, 'variants': []},
 {'input': 'haer', 'offset': {'begin': 155, 'end': 159}, 'variants': []},
 {'input': 'Ho', 'offset': {'begin': 160, 'end': 162}, 'variants': []},
 {'input': 'Mo', 'offset': {'begin': 163, 'end': 165}, 'variants': []},
 {'input': 'resolutie', 'offset': {'begin': 167, 'end': 176}, 'variants': []},
 {'input': 'vanden', 'offset': {'begin': 177, 'end': 183}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 186, 'end': 188}, 'variants': []},
 {'input': 'daer', 'offset': {'begin': 190, 'end': 194}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 195, 'end': 197}, 'variants': []},
 {'input': 'vooren', 'offset': {'begin': 198, 'end': 204}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 205, 'end': 208}, 'variants': []},
 {'input': 'selver', 'offset': {'begin': 209, 'end': 215}, 'variants': []},
 {'input': 'consideratien',
  'offset': {'begin': 216, 'end': 229},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 230, 'end': 234}, 'variants': []},
 {'input': 'advis', 'offset': {'begin': 235, 'end': 240}, 'variants': []},
 {'input': 'op', 'offset': {'begin': 241, 'end': 243}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 244, 'end': 246}, 'variants': []},
 {'input': 'reqte', 'offset': {'begin': 247, 'end': 252}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 254, 'end': 257}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 258, 'end': 266}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 267, 'end': 274}, 'variants': []},
 {'input': 'Huijsvrouw', 'offset': {'begin': 275, 'end': 285}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 286, 'end': 290}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 291, 'end': 298}, 'variants': []},
 {'input': 'Jans', 'offset': {'begin': 299, 'end': 303}, 'variants': []},
 {'input': 'moeder', 'offset': {'begin': 305, 'end': 311}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 312, 'end': 315}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 316, 'end': 319}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 320, 'end': 325}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 326, 'end': 329}, 'variants': []},
 {'input': 'Delff', 'offset': {'begin': 330, 'end': 335}, 'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 337, 'end': 349},
  'variants': []},
 {'input': 'dat', 'offset': {'begin': 350, 'end': 353}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 354, 'end': 356}, 'variants': []},
 {'input': 'voorn', 'offset': {'begin': 357, 'end': 362}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 363, 'end': 366}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 367, 'end': 372}, 'variants': []},
 {'input': 'uijt', 'offset': {'begin': 374, 'end': 378}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 379, 'end': 382}, 'variants': []},
 {'input': 'spin', 'offset': {'begin': 383, 'end': 387}, 'variants': []},
 {'input': 'huijs', 'offset': {'begin': 391, 'end': 396}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 397, 'end': 400}, 'variants': []},
 {'input': 'Stadt', 'offset': {'begin': 401, 'end': 406}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 407, 'end': 416}, 'variants': []},
 {'input': 'souden', 'offset': {'begin': 417, 'end': 423}, 'variants': []},
 {'input': 'mogen', 'offset': {'begin': 424, 'end': 429}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 430, 'end': 436}, 'variants': []},
 {'input': 'ontslagen', 'offset': {'begin': 437, 'end': 446}, 'variants': []},
 {'input': 'mits', 'offset': {'begin': 448, 'end': 452}, 'variants': []},
 {'input': 'hem', 'offset': {'begin': 453, 'end': 456}, 'variants': []},
 {'input': 'in', 'offset': {'begin': 457, 'end': 459}, 'variants': []},
 {'input': 'dienst', 'offset': {'begin': 460, 'end': 466}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 467, 'end': 470}, 'variants': []},
 {'input': 'desen', 'offset': {'begin': 471, 'end': 476}, 'variants': []},
 {'input': 'Staet', 'offset': {'begin': 477, 'end': 482}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 483, 'end': 485}, 'variants': []},
 {'input': 'water', 'offset': {'begin': 486, 'end': 491}, 'variants': []},
 {'input': 'off', 'offset': {'begin': 492, 'end': 495}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 496, 'end': 498}, 'variants': []},
 {'input': 'Lande', 'offset': {'begin': 499, 'end': 504}, 'variants': []},
 {'input': 'begevende', 'offset': {'begin': 505, 'end': 514}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 516, 'end': 518}, 'variants': []},
 {'input': 'is', 'offset': {'begin': 519, 'end': 521}, 'variants': []},
 {'input': 'bij', 'offset': {'begin': 522, 'end': 525}, 'variants': []},
 {'input': 'die', 'offset': {'begin': 526, 'end': 529}, 'variants': []},
 {'input': 'occasie', 'offset': {'begin': 530, 'end': 537}, 'variants': []},
 {'input': 'mede', 'offset': {'begin': 538, 'end': 542}, 'variants': []},
 {'input': 'gelesen', 'offset': {'begin': 543, 'end': 550}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 551, 'end': 553}, 'variants': []},
 {'input': 'nadere', 'offset': {'begin': 554, 'end': 560}, 'variants': []},
 {'input': 'requeste', 'offset': {'begin': 561, 'end': 569}, 'variants': []},
 {'input': 'vande', 'offset': {'begin': 570, 'end': 575}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 576, 'end': 582}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 584, 'end': 592}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 593, 'end': 600}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 601, 'end': 603}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 604, 'end': 611}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 612, 'end': 617}, 'variants': []},
 {'input': 'noghmaels', 'offset': {'begin': 619, 'end': 628}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 629, 'end': 631}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 632, 'end': 638}, 'variants': []},
 {'input': 'ontslaginge',
  'offset': {'begin': 639, 'end': 650},
  'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 651, 'end': 663},
  'variants': []},
 {'input': 'Waerop', 'offset': {'begin': 665, 'end': 671}, 'variants': []},
 {'input': 'gedelibereert',
  'offset': {'begin': 672, 'end': 685},
  'variants': []},
 {'input': 'sijnde', 'offset': {'begin': 686, 'end': 692}, 'variants': []},
 {'input': 'Is', 'offset': {'begin': 694, 'end': 696}, 'variants': []},
 {'input': 'goetgevonden',
  'offset': {'begin': 697, 'end': 709},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 710, 'end': 714}, 'variants': []},
 {'input': 'ver', 'offset': {'begin': 715, 'end': 718}, 'variants': []},
 {'input': 'staen', 'offset': {'begin': 722, 'end': 727}, 'variants': []},
 {'input': 'dat', 'offset': {'begin': 729, 'end': 732}, 'variants': []},
 {'input': 'int', 'offset': {'begin': 733, 'end': 736}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 737, 'end': 743}, 'variants': []},
 {'input': 'versoeck', 'offset': {'begin': 744, 'end': 752}, 'variants': []},
 {'input': 'niet', 'offset': {'begin': 753, 'end': 757}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 758, 'end': 760}, 'variants': []},
 {'input': 'can', 'offset': {'begin': 761, 'end': 764}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 765, 'end': 771}, 'variants': []},
 {'input': 'getreden', 'offset': {'begin': 772, 'end': 780}, 'variants': []}]

which has a shifting offset, presumably because of the character (unicode '\u201e'). For example the last reported input ('getreden') gives an offset of 'offset': {'begin': 772, 'end': 780}, but a python text.find('getreden') reports 768

text[772:780] is 'eden.'

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

proycon commented 2 years ago

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned explicitly in the README too). Python string slices use unicode points. So there's indeed a mismatch there. I should probably implement an option that makes analiticcl return unicode points, which would probably make more sense to be used as the default in at least the Python binding. From a data-representation perspective, using unicode points would be the most elegant option too. It does come at a slight performance penalty, which is why I'm not using it internally.

@brambg: This is relevant for our Golden Agents to Web Annotation export too, as if I'm not mistaken, web annotations represents offsets with unicode points as well (and rightly so).

proycon commented 2 years ago

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned explicitly in the README too). Python string slices use unicode points. So there's indeed a mismatch there. I should probably implement an option that makes analiticcl return unicode points, which would probably make more sense to be used as the default in at least the Python binding. From a data-representation perspective, using unicode points would be the most elegant option too. It does come at a slight performance penalty, which is why I'm not using it internally.

@brambg: This is relevant for our Golden Agents to Web Annotation export too, as if I'm not mistaken, web annotations represents offsets with unicode points as well (and rightly so).

proycon commented 2 years ago

I implemented support for this now, to be released in the upcoming 0.4.0 release. You'll need to explicitly enable it though, using the --unicode-offsets parameters (or unicodeoffsets=True from Python as keyword argument to SearchParameters).

proycon commented 2 years ago

Implemented and released