Open caryknoop opened 9 months ago
Might be something to do with the word tokeniser. Needs investigation.
A kludgy solution is to write a magic string (i.e. something that does not naturally exist in a text - for instance 'xx2yyjjkkll4324') after a paragraph and filter it after redlining.
@caryknoop To check my bearings... could you comment what you expect the answer to be for your reproduction?
text1 = f"This is a sentence\n\nexamples"
text2 = f"This is a sentence"
redlined = Redlines(text1, text2, markdown_style=None)
result = redlined.output_markdown
I am thinking it could be:
This is a sentence¶¶examples
Investigators gonna be investigating:
Table shows various tokenisers and their outputs and expected redlines.
Method | Test sequence 1 | Test sequence 2 | Expected redline |
---|---|---|---|
Original redlines | ['This ', 'is ', 'a ', 'sentence¶examples'] |
['This ', 'is ', 'a ', 'sentence'] |
This is a |
tiktoken | [2028, 374, 264, 11914, 271, 52768] |
[2028, 374, 264, 11914] |
This is a sentence |
spacy (en_core_web_sm) | [This, is, a, sentence, \n\n , examples] |
[This, is, a, sentence] | This is a sentence |
GPT2 (through HF) | [1212, 318, 257, 6827, 198, 198, 1069, 12629] |
[1212, 318, 257, 6827] | 'This is a sentence |
Seems that BPE tokenisers give more consistent output (although they require you to download a model).
Should we change the tokenizer? Or give others an option to choose between the regex solution or the BPE one? :thinking:
To reproduce:
Result:
This is a
sentenceexamplessentenceThis is a <span style='color:red;font-weight:700;text-decoration:line-through;'>sentence <br><br> examples</span><span style='color:green;font-weight:700;'>sentence</span>