houfu / redlines

Show the differences between two strings/text as a compact text, in markdown/HTML, in the terminal and more.
https://houfu.github.io/redlines/
MIT License
103 stars 5 forks source link

Should we implement a new tokenizer? #29

Open caryknoop opened 9 months ago

caryknoop commented 9 months ago

To reproduce:

from redlines import Redlines

def main():
    text1 = f"This is a sentence\n\nexamples"
    text2 = f"This is a sentence"

    redlined = Redlines(text1, text2)
    result = redlined.output_markdown
    result = result.replace('¶','<br><br>')
    with open('text.html', 'w', encoding='utf-8') as file:
        file.write(result)

if __name__ == '__main__':
    main()

Result:

This is a sentence

examplessentence

This is a <span style='color:red;font-weight:700;text-decoration:line-through;'>sentence <br><br> examples</span><span style='color:green;font-weight:700;'>sentence</span>

houfu commented 9 months ago

Might be something to do with the word tokeniser. Needs investigation.

caryknoop commented 9 months ago

A kludgy solution is to write a magic string (i.e. something that does not naturally exist in a text - for instance 'xx2yyjjkkll4324') after a paragraph and filter it after redlining.

houfu commented 9 months ago

@caryknoop To check my bearings... could you comment what you expect the answer to be for your reproduction?

    text1 = f"This is a sentence\n\nexamples"
    text2 = f"This is a sentence"

    redlined = Redlines(text1, text2, markdown_style=None)
    result = redlined.output_markdown

I am thinking it could be:

This is a sentence¶¶examples

houfu commented 9 months ago

Investigators gonna be investigating:

Table shows various tokenisers and their outputs and expected redlines.

Method Test sequence 1 Test sequence 2 Expected redline
Original redlines ['This ', 'is ', 'a ', 'sentence¶examples'] ['This ', 'is ', 'a ', 'sentence'] This is a sentence¶examplessentence
tiktoken [2028, 374, 264, 11914, 271, 52768] [2028, 374, 264, 11914] This is a sentence \n\nexamples
spacy (en_core_web_sm) [This, is, a, sentence, \n\n , examples] [This, is, a, sentence] This is a sentence \n\nexamples
GPT2 (through HF) [1212, 318, 257, 6827, 198, 198, 1069, 12629] [1212, 318, 257, 6827] 'This is a sentence \n\nexamples'

Seems that BPE tokenisers give more consistent output (although they require you to download a model).

Should we change the tokenizer? Or give others an option to choose between the regex solution or the BPE one? :thinking: