rbawden / ModFr-Norm

5 stars 0 forks source link

alignment produced by the HF pipeline broken #6

Closed andbue closed 8 months ago

andbue commented 9 months ago

Hi, thanks for providing this pipeline! I've run into a weird problem: the "alignment" in the output is totally different from the one in README.md when I run the example with transformers 4.35.2, e.g. on Colab:

from transformers import pipeline
normaliser = pipeline(model="rbawden/modern_french_normalisation", batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)

list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = normaliser(list_inputs)
print(list_outputs)

>> [{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine;', 
'alignment': [([0, 4], [0, 4]), ([4, 5], [4, 4]), ([5, 13], [4, 12]), ([13, 14], [12, 12]), ([14, 30], [12, 28]), ([30, 31], [28, 28]), ([31, 33], [28, 30]), ([33, 34], [30, 30]), ([34, 42], [30, 38]), ([42, 43], [38, 38]), ([43, 45], [38, 40]), ([45, 46], [40, 40]), ([46, 54], [40, 48]), ([54, 55], [48, 49])]}, 
{'text': "Adieu, j'irai chez vous tantôt vous rendre grâce.", 
'alignment': [([0, 5], [0, 5]), ([5, 6], [5, 6]), ([6, 7], [6, 6]), ([7, 9], [6, 8]), ([9, 13], [8, 12]), ([13, 14], [12, 12]), ([14, 18], [12, 16]), ([18, 19], [16, 16]), ([19, 23], [16, 20]), ([23, 24], [20, 20]), ([24, 31], [20, 26]), ([31, 32], [26, 26]), ([32, 36], [26, 30]), ([36, 37], [30, 30]), ([37, 43], [30, 36]), ([43, 44], [36, 36]), ([44, 49], [36, 41]), ([49, 50], [41, 42])]}]

here for comparison your output from README.md:

>> [{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine; ', 
'alignment': [([0, 3], [0, 3]), ([5, 12], [5, 12]), ([14, 29], [14, 29]), ([31, 32], [31, 32]), ([34, 41], [34, 41]), ([43, 44], [43, 44]), ([46, 53], [46, 53]), ([54, 54], [54, 54])]}, 
{'text': "Adieu, j'irai chez vous tantôt vous rendre grâce. ", 
'alignment': [([0, 4], [0, 4]), ([5, 5], [5, 5]), ([7, 8], [7, 8]), ([9, 12], [9, 12]), ([14, 17], [14, 17]), ([19, 22], [19, 22]), ([24, 30], [24, 29]), ([32, 35], [31, 34]), ([37, 42], [36, 41]), ([44, 48], [43, 47]), ([49, 49], [48, 48])]}]

As you can see, all the indices are off by one and it seems that whitespace in the prediction is ignored. Do you have any idea what could be the issue here? In pipeline.py there is a bit # if character not in [' ']] commented out, could that be somehow related?

andbue commented 9 months ago

Maybe I don't understand the purpose of the alignment, but in get_char_idx_align, all whitespace in the aligned strings is eliminated via .strip() and thereby counted as zero length. In addition to that, the function post_cleaning can change the length of words in the prediction, which is not taken into account in get_char_idx_align, since the latter only counts the characters in the alignment.

rbawden commented 9 months ago

Hi there! Thanks for your message and for bringing this up! I'll try to have a look at this and get back to you ASAP (with correcitons), although with the holidays, this could be in the new year.

rbawden commented 8 months ago

Hi @andbue and happy new year!

I have had a look over the code and made a minor correction in the get_char_idx_align function (commenting out line 777 in the HuggingFace pipeline, which had the effect of changing whitespace around the prediction) (https://huggingface.co/rbawden/modern_french_normalisation/blob/3208620b35cd86157337e983e74bb1aee076478f/pipeline.py#L777). This seems to produce a correct alignment for the examples I tested.

A second change I made was actually to the README, as it was not up-to-date with respect to the code. The two differences:

  1. The alignment produced by the code also takes into account alignment of whitespace, which was necessary in some cases (e.g. when whitespace is aligned with non-whitespace).
  2. The indices indicate the inter-character points. This means that in the example "Elle haïſſoit particulierement", the span "Elle" is associated with the character indices [0, 4]. This can also be handier for indexing the span directly in the original string (E.g. input_sent[span[0]:span[1]]).

This appears to work for the examples I tested, but please let me know if you see any more problems!

andbue commented 8 months ago

A happy new year to you as well! Thank you for the quick fix, the alignments in my tests are looking correct now.