Open kouohhashi opened 4 years ago
I ran into the same issue.
tl;dr - the following line is a possible cause of the issue (but it might not be the ultimate root-cause): https://github.com/parlance/ctcdecode/blob/9a20e00f34d8f605f4a8501cc42b1a53231f1597/ctcdecode/src/path_trie.cpp#L44
this line allows overwriting the timestep of an existing trie-node at some place in the trie-tree-data-structure, with a lower value causing the timestep to suddenly drop.
I replaced this with the following guard logic (which seems to me as a cosmetic workaround, but at least it seems to work):
if ( child->second->timestep > new_timestep ) { child->second->timestep = new_timestep; }
I tried to understand the logic behind the code, but I was able to only partially understand what is going on there. it seems that all suffixes are retained at every timestep (up to beam-width, using sort by score). during an attempt to add a new character to an existing beam (which is one of the trie-prefixes), sometimes this beam has children of it's own. if one of the children of this beam equals to the new character, then the existing child-node is overwritten with a new probability score, and a new timestep. however, this may cause a wrong timestep at the wrong place in the sequence.
i'm not sure why all suffixes are retained though... couldn't fully grasp that. perhaps there is a more fundamental bug there...
@SeanNaren , do you think it's worthwhile to ask the original authors?
fyi - @spakhomov
note: seems that issue #91 talks about the same thing
@levhaikin @kouohhashi Useful observations! I'm using the probability distribution over characters in the language for each timestep in order to extract the timestamps of space characters. I'm doing this to extract word alignments from the probability distribution. Unfortunately it looks like, given a long sequence (say a half hour audio file), I am getting timestamps that are skewed more and more starting from the beginning of the file.
At the beginning the time stamps seem reasonable, but after even a minute or so, the timestamps are offset significantly and quickly build up over time. I don't think an offset would fix as it isn't a consistent skew. This would also impact beam searching for me as that would require the time stamps.
Am I correct in thinking that the issue you guys are facing is related?
@ShantanuNair
not sure it's related directly..
I have never tried feeding the network and beam-decoder with such long sequences.. a common practice is to break the long audio into distinct, shorter utterances first, and then feed the isolated utterances into the network and decoder. this can be done using speech-detection/silence-detection for example. there are a lot of open-source tools that use simple energy based techniques to detect silences. if i'm not mistaken, the nvidia nemo kit has a neural based speech-detection mechanism.
Hi,
What about update like this:
auto children_of_child = child->second->children_;
auto comp = [](const std::pair<int, PathTrie*> &x, const std::pair<int, PathTrie*> &y){
return x.second->timestep < y.second->timestep;
};
auto first_child = std::min_element(children_of_child.begin(), children_of_child.end(), comp);
if (first_child == children_of_child.end() || first_child->second->timestep >= new_timestep){
child->second->timestep = new_timestep;
}
Above code should update timestep only in case if it is not make inconsitency in final timesteps. Of course it can require in some cases to additional iterate over all children of some node.
One think that I can not understood is that it seems that we always call method get_path_trie with nondescending values of newtimestep. So first child in vector children should have always the lowest timestep and more generally children should be kept in order of their timesteps, but I checked it out and it is not true. Removing nodes also should not change order and not break anything. Any idea why children may not be kept in order of timesteps?
UPDATE: The above condition is equivalent to:
if child->second->children_.empty()){
child->second->timestep = new_timestep;
}
As new_timestep is always not lower than all previous timesteps in PathTrie, so the second part of above condition: first_child->second->timestep >= new_timestep
can be fulfilled then and only then child->second->timestep == new_timestep == first_child->second->timestep
For the same reason I think that @levhaikin if ( child->second->timestep > new_timestep ) { child->second->timestep = new_timestep; }
it can be never fulfilled.
hi,
I ended up reimplementing the whole algorithm from scratch in c++..
came to the conclusion that merging paths that have same "squeezed" form loses information that can't be recovered afterwards. key idea was to create code that holds a list of beams in their "unsqueezed" form at all times. unsqueezed beams contain repetitions and blanks. there could be two different beams that have the same squeezed form (i.e. when blanks and repetitions are removed they are the same). for that I implemented a different custom trie class. these beams are merged if possible at every timestep, and the one with higher probability is kept. when unfolding through time is finished, we can back-traverse the beams in order to get the correct time-step and count for each character. it's possible because blanks and repetitions are saved in the beam. we can also compute confidence for each character because we know where it was. there are some subtleties, but overall this works quite well.
unfortunately can't share the code because it's proprietary at this moment.. need to go through many legal hoops in order to make it public..
p.s. while I was at it, I also removed openfst dependency. it was surprisingly easy to implement a very small dedicated trie class (different from the above one) for holding all words vocabulary, and make it "unfold" in tandem with the beam-search walk (which is what openfst was used for). ~25 lines of cc code and a few more lines in the header file. also, removed boost dependency as it wasn't used.
@levhaikin I understand the propiertary issues in sharing code. Anyway when it is possible I would like to see your code. How about the efficiency(computation time and memory usage), because keeping all hypothesis in "squeezed" form seems to be motivated mainly to decrease memory usage and computation time. Is it your solution comparable to original ctcdecode?
I updated my above comment about updates in the existing code.
@panmareksadowski regarding efficiency, I didn't run back-to-back tests comparing the two codebases, but generally they seem to be in the same ballpark, at least concerning speed. I can say that i'm able to ingest whole audio files of up to an hour (and more) in duration without any issues. moreover, i'm able ingest many such audio files in parallel, e.g. 32 in parallel (at least, didn't go with more due to other processes running in the background). no memory issues, no special speed issues (compared to my prior experience with speech-to-text).. in fact, I couldn't do that with other technologies I was using in the past. always had to break-down audio into segments of up to 10-20 seconds (by using some kind of silence detection logic usually).
Is there any progress on this issue?
Hi,
I'm using deepspeech.pytorch but experiencing offset inconsistency problem. I seems returned offsets form CTCDecoder is not consistent.
Is there anyway to prevent this problem?
https://github.com/SeanNaren/deepspeech.pytorch/issues/483
Thanks in advance.