Does wordbeamsearch allow for languages without spacing?

thetruejacob commented 3 years ago

I've currently been using SimpleHTR in my day to day work, and the WBS functionality has been very useful. The issue is this: I'm using it mainly for the Thai language, where there are no spaces between words within a sentence. Something weird happens in the below sentence. Actual label: คุณต้องระวังอะไรเป็นพิเศษ? (the words in this sentence are: คุณ ต้อง ระวัง อะไร เป็น พิเศษ)

Bestpath prediction: คุณต้องระวังอะไรเป็นพิเศษ7

WBS prediction: คุณ.ต้อง-ระวัง%มร(ปีน9เศษ7

As you might be able to see, the words predicted are generally correct, but there are non-word characters introduced between them as if they were stopword characters or punctuation. Is there any workaround for this? I understand that wbs was likely designed with latin languages with natural gaps between words in mind, but is there a way to not force the introduction of nonword characters as punctuation for other languages?

githubharald commented 3 years ago

yes, that's by design. When a word is finished, there must be a non-word-character. I don't see a quick work-around for that.

If a word is finished, it can either be extended by a non-word-char, or by a character to get to an even longer word. E.g. "hell" is a finished word and could be extended to "hell " or "hello". You would have to allow for even more chars in the case of a finished word - namely starting with a completely new one as soon as the old one is finished.

Are you working with the C++ or Python implementation of WBS?

thetruejacob commented 3 years ago

I'm using the Python implementation - the default used in this repo.

thetruejacob commented 3 years ago

Is there a way to designate a special non-word character to segment words? Such a character should not be available anywhere else in the corpus and so can be easily removed during post-processing. And then the entire dataset will have to be relabeled to insert it at every word boundary - but is there a better, cheaper way?

githubharald commented 3 years ago

I'm using the Python implementation - the default used in this repo.

ok, so you're using the Python interface of the C++ implementation.

Is there a way to designate a special non-word character to segment words? Such a character should not be available anywhere else in the corpus and so can be easily removed during post-processing. And then the entire dataset will have to be relabeled to insert it at every word boundary - but is there a better, cheaper way?

yes, that would be a hack that might work. But I would rather try to change the code so that it provides the behavior you need. You would have to change the C++ code in this function, which does exactly what I wrote yesterday, it provides the next possible characters, given the last word that is iteratively created. https://github.com/githubharald/CTCWordBeamSearch/blob/master/cpp/LanguageModel.cpp#L199

There you would have to add the case when the currently created word is a word from the dictionary - then you would have to also output all characters that are the starting characters of new words.

thetruejacob commented 3 years ago

I'm still a little confused here.

This is what I'm seeing.

// query tree
std::vector<uint32_t> res(m_tree.getNextChars(text));

// if between words or if word is complete, then add non word chars
if (text.empty() || isWord(text))
{
    res.insert(res.end(), m_nonWordLabels.begin(), m_nonWordLabels.end());
}
return res;

When a word is finished, this code already allows for the cases where:

the word can be extended by word characters to form a longer word
the word can be extended by non-word characters

I'm looking for the case where:

the word can be extended by word characters to form a new word.

So

// query tree
std::vector<uint32_t> res(m_tree.getNextChars(text));

// if between words or if word is complete, then add non word chars
if (text.empty())
{
    res.insert(res.end(), m_nonWordLabels.begin(), m_nonWordLabels.end());
}

if (isWord(text))
{
    res.insert(res.end(), m_allLabels.begin(), m_allLabels.end());
}
return res;

Something like this?

githubharald commented 3 years ago

what you want is:

query prefix tree for current text (e.g. prefix "Hell" can be extended to "Hello" -> tree would return "o")
check if text is empty, if yes, add non word chars (e.g. numbers, whitespace, ...)
check if text is a complete word, if yes, add non word chars and also chars which the tree returns when queried for an empty text

So it should look roughly like this (I did not test it):

// query tree
std::vector<uint32_t> res(m_tree.getNextChars(text));

// text empty: add non word chars
if (text.empty())
{
    res.insert(res.end(), m_nonWordLabels.begin(), m_nonWordLabels.end());
}

// word is finished: add both non word chars and also chars which start new word
if (isWord(text))
{
    res.insert(res.end(), m_nonWordLabels.begin(), m_nonWordLabels.end());

    // query prefix tree for all the first characters of all words
    const std::vector<uint32_t> firstChars = m_tree.getNextChars(std::vector<uint32_t>());
    res.insert(res.end(), firstChars.begin(), firstChars.end());
}

githubharald commented 3 years ago

now that I look at your code once again I see that it's almost the same what I have, but you're inserting all possible characters. The problem here is that there might be characters which never start words which should not be included in the possible next chars.

thetruejacob commented 3 years ago

Does your code allow for the case of 1. (the finished word can be extended by additional word characters to form a longer word)?

It seems to me that choosing only the first characters of the words in the corpus might not work in the case e.g. 'hell' should be extended by 'o' to become 'hello', but there are no words in the corpus beginning with 'o'.

githubharald commented 3 years ago

yes, that gets handled in the first line, as the text contains the current word ("Hell") and then the prefix tree is queried for chars following the prefix (like "o").

But I think the issue is more involved then I first thought: there's code which takes care of remembering the word that is currently created ("H", "He", "Hel", ...). And this code resets the word to an empty string as soon as a non-word-char would be added. This makes sense in languages with word separators. But in your case, you would have to get the information that you want to start a new word immidiatelly after another word from the method LanguageModel::getNextChars somehow to the method Beam::createChildBeam.

That's the code I'm talking about: https://github.com/githubharald/CTCWordBeamSearch/blob/master/cpp/Beam.cpp#L137

thetruejacob commented 2 years ago

Could you please expand more on how I can do this or point me in the right direction? I am currently focusing on this task. This would definitely be helpful for my use case for the Thai language, and I'm sure many other users who want to adapt it to certain languages (Thai, Lao, Tibetan, Burmese, Mongolian etc) would also highly appreciate it.

What I'm slightly confused by is why I need to worry about the memory for each new word 'resetting' after a non word character is added - after a word is completed, does it not just start with a new word immediately anyways? Or more importantly, where is the part in the code where a space is inserted before moving on to the next word? I was simply assuming if this piece of code was removed, I would get what I want - a long string of words, and spaces are simply treated as another character. I actually believe I get this behavior with normal beamsearch/bestpath decoder - everything looks nice (but of course less accurate), while the WBS decoder is more accurate but introduces spaces unnecessarily.

Is there a way to use spaces only as a character, indistinguishable from any other character? Where are they introduced?

Thank you again for your continued maintenance on this project.

githubharald / SimpleHTR

Does wordbeamsearch allow for languages without spacing? #122