segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
707 stars 40 forks source link

Newlines are treated like spaces #131

Open markus583 opened 3 weeks ago

markus583 commented 3 weeks ago

If we pass in newlines to sat.split, it will be treated like a space since the tokenizer does not know it (and it should not, otherwise the embedding would break). The text is still reconstructed fine since the character probabilities from the tokenizer's space are transferred to the newline.

While we will not adapt the core functionality here, should we add some error logging if \n occurs within the input text? @bminixhofer @igorsterner

bminixhofer commented 2 weeks ago

While we will not adapt the core functionality here

Why not? I feel like the most intuitive option would be to always break on newline characters. And are you sure it is treated like a space? From https://github.com/segment-any-text/wtpsplit/issues/133:

I replaced all the newlines with spaces and it seems work.

markus583 commented 2 weeks ago

The newline character is still preserved in the text, so we don't treat it in any special way.

    model = SaT("sat-9l")

    text = "This is a\n test sentence. This is another test sentence."

    out = model.split(text)
    print(out)

results in: ['This is a\n test sentence.', 'This is another test sentence.']

But this

    for sent in out:
        print(sent)

results in:

This is a
 test sentence.

This is another test sentence.

Notice the extra newline here! It is because it prints This is a\n test sentence..

So it is ambiguous now; that's why I raised the issue. If we were to split on this newline, too, the list would be of length 3. We could add this as a post-processing step (within the model, \n is just handled as a space). Or add a warning.

bminixhofer commented 2 weeks ago

And the model sees this as "This is a test sentence", correct? I still don't understand the result from #133 in that case, @markus583 do you?

Irrespective of that, in my opinion the most natural way would be to split on newlines by default (so ["This is a", "test sentence", "This is another sentence."] in the above example) and have a flag treat_newline_as_space which restores the previous behavior. What do you think?

markus583 commented 1 week ago

Yes! Or, "This is a test sentence" (double space), to be very precise. No, I'm also a bit puzzled by #133. I'll ask for clarification.

Probably, yes - sounds very reasonable to me. Thanks! I'll add the feature when I find some time.