Closed andbue closed 8 months ago
Maybe I don't understand the purpose of the alignment, but in get_char_idx_align
, all whitespace in the aligned strings is eliminated via .strip()
and thereby counted as zero length. In addition to that, the function post_cleaning
can change the length of words in the prediction, which is not taken into account in get_char_idx_align
, since the latter only counts the characters in the alignment.
Hi there! Thanks for your message and for bringing this up! I'll try to have a look at this and get back to you ASAP (with correcitons), although with the holidays, this could be in the new year.
Hi @andbue and happy new year!
I have had a look over the code and made a minor correction in the get_char_idx_align
function (commenting out line 777 in the HuggingFace pipeline, which had the effect of changing whitespace around the prediction) (https://huggingface.co/rbawden/modern_french_normalisation/blob/3208620b35cd86157337e983e74bb1aee076478f/pipeline.py#L777). This seems to produce a correct alignment for the examples I tested.
A second change I made was actually to the README, as it was not up-to-date with respect to the code. The two differences:
[0, 4]
. This can also be handier for indexing the span directly in the original string (E.g. input_sent[span[0]:span[1]]
).This appears to work for the examples I tested, but please let me know if you see any more problems!
A happy new year to you as well! Thank you for the quick fix, the alignments in my tests are looking correct now.
Hi, thanks for providing this pipeline! I've run into a weird problem: the "alignment" in the output is totally different from the one in README.md when I run the example with transformers 4.35.2, e.g. on Colab:
here for comparison your output from README.md:
As you can see, all the indices are off by one and it seems that whitespace in the prediction is ignored. Do you have any idea what could be the issue here? In pipeline.py there is a bit
# if character not in [' ']]
commented out, could that be somehow related?