Should we implement a new tokenizer?

caryknoop commented 9 months ago

To reproduce:

from redlines import Redlines

def main():
    text1 = f"This is a sentence\n\nexamples"
    text2 = f"This is a sentence"

    redlined = Redlines(text1, text2)
    result = redlined.output_markdown
    result = result.replace('¶','<br><br>')
    with open('text.html', 'w', encoding='utf-8') as file:
        file.write(result)

if __name__ == '__main__':
    main()

Result:

This is a ~~sentence~~

~~examples~~sentence

This is a sentence examplessentence

houfu commented 9 months ago

Might be something to do with the word tokeniser. Needs investigation.

caryknoop commented 9 months ago

A kludgy solution is to write a magic string (i.e. something that does not naturally exist in a text - for instance 'xx2yyjjkkll4324') after a paragraph and filter it after redlining.

houfu commented 9 months ago

@caryknoop To check my bearings... could you comment what you expect the answer to be for your reproduction?

    text1 = f"This is a sentence\n\nexamples"
    text2 = f"This is a sentence"

    redlined = Redlines(text1, text2, markdown_style=None)
    result = redlined.output_markdown

I am thinking it could be:

This is a sentence~~¶¶examples~~

houfu commented 9 months ago

Investigators gonna be investigating:

Table shows various tokenisers and their outputs and expected redlines.

Method	Test sequence 1	Test sequence 2	Expected redline
Original redlines	`['This ', 'is ', 'a ', 'sentence¶examples']`	`['This ', 'is ', 'a ', 'sentence']`	This is a ~~sentence¶examples~~sentence
tiktoken	`[2028, 374, 264, 11914, 271, 52768]`	`[2028, 374, 264, 11914]`	This is a sentence ~~\n\nexamples~~
spacy (en_core_web_sm)	`[This, is, a, sentence, \n\n , examples]`	[This, is, a, sentence]	This is a sentence ~~\n\nexamples~~
GPT2 (through HF)	`[1212, 318, 257, 6827, 198, 198, 1069, 12629]`	[1212, 318, 257, 6827]	'This is a sentence ~~\n\nexamples~~'

Seems that BPE tokenisers give more consistent output (although they require you to download a model).

Should we change the tokenizer? Or give others an option to choose between the regex solution or the BPE one? :thinking:

houfu / redlines

Should we implement a new tokenizer? #29