Parallel mode doesn’t work for NGramsForecast or NGramsForecastAndForecast

oldyen commented 5 years ago

Either segmentation fault or corruption error. It works well with Words mode and NGrams mode. Thank you

githubharald commented 5 years ago

Another user already reported this for the parallel mode (#8). I can't reproduce this on my computer, therefore I can not provide any fix for this. I would suggest not using the parallel mode in this case.

Contributions are welcome regarding this bug.

oldyen commented 5 years ago

Thank you for your reply. BTW, your code is awesome.

I read the previously report that your mentioned and found the different was that I can't reproduce your result even if I used the exactly the same files. I tried both your tf/testCustomOp.py or SimpleHTR + CTCWordBeamSearch.

When I used SimpleHTR + CTCWordBeamSearch, I got Segmentation fault When I used tf/testCustomOp.py, I got:

**Mini example:

Label string: [1 0 3]

Char string: "ba"

Real example:

Label string: [76 78 59 70 66 77 77 0 59 72 77 65 0 70 62 71 77 58 69 0 58 71 61 0

60 72 75 73 72 75 62 58 69 10 0 0 0 0 0 77 65 62 93 93 93 93 93 93

93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93

93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93 93

93 93 93 93]

Char string: "submitt both mental and corporeal, the"

Traceback (most recent call last):

File "testCustomOp.py", line 90, in
testRealExample()
File "testCustomOp.py", line 84, in testRealExample
assert res[1] == 'submitt both mental and corporeal, is far beyond any idea'
AssertionError**

and I print res[1], Char string: "submitt both mental and corporeal, the"

Do you have any hint?
I used g++ 5.5 and have tried different version of tensorflow on AWS.

Thank you

githubharald commented 5 years ago

You could try to use the code snippet from the linked issue (see last comment), and then re-compile the custom op.

Other than that, you could create a crash dump file and look at the call stack with gdb. This might give some hint where the crash happens in the C++ code.

To create crash dumps, something like that should work: Enter the following into the cmd line:

ulimit -c unlimited
sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t

Then, run the script from the cmd line and wait until the custom op causes the crash. A crash dump file should occur in /tmp. Open it with gdb by specifying the binary (compiled custom op) and the crash dump file: gdb binaryfile dumpfile Once in gdb, enter bt to see the function call stack and report it.

oldyen commented 5 years ago

You could try to use the code snippet from the linked issue (see last comment), and then re-compile the custom op.

Other than that, you could create a crash dump file and look at the call stack with gdb. This might give some hint where the crash happens in the C++ code.

To create crash dumps, something like that should work: Enter the following into the cmd line:
ulimit -c unlimited
sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t
Then, run the script from the cmd line and wait until the custom op causes the crash. A crash dump file should occur in /tmp. Open it with gdb by specifying the binary (compiled custom op) and the crash dump file: gdb binaryfile dumpfile Once in gdb, enter bt to see the function call stack and report it.

Thank you! I think that could be lib issue; I will let you know. Thank you

oldyen commented 5 years ago

the corpus file causes the problem. When I used corpus.txt file from simplehtr, I got error. And it works fine if I use small corpus.txt file, e.g., ctcwordbeamsearch/data/iam/corpus.txt. It may not be the out of memory? because I can run the code when using simplehtr's corpus.txt under single threading mode. Do you have any suggestion?

githubharald commented 5 years ago

out-of-memory would usually throw a C++ exception which would terminate the program in a controlled way, probably also with some information printed to the console. So it's most likely something else, which however occurs more likely with one of the corpus files.

Did you try to get the crash dump file (see above)? Getting the call stack causing the crash would give a good hint where to find the bug.

oldyen commented 5 years ago

Thank you for your answer. You are awesome!

Yes, I did. I tried the same file/code/setting, and got different error each times. For example 1: trackback_1: _int_malloc () from /lib64/libc.so.6 trackback_k: Beam::getNextWordsSampled(std::shared_ptr const&, std::vector<unsigned int, std::allocator > const&) const () from ../cpp/proj/TFWordBeamSearch.so

example 2: traceback_1: malloc_consolidate () from /lib64/libc.so.6 traceback_2: int_free () from /lib64/libc.so.6 traceback_k: Beam::handleNGrams(std::shared_ptr&, unsigned int) const () from ../cpp/proj/TFWordBeamSearch.so

example 3: trackback_1 : raise () from /lib64/libc.so.6 traceback_2: abort () from /lib64/libc.so.6 traceback_k: PrefixTree::getNextWords (this=0x55ba6b59f030, text=...) at ../src/PrefixTree.cpp:152

Here, I only copy and lines that may give me a hint to solve the problem. Now, I guess the problems could be (1) double free pointer? in might be Beam.cpp/PrefxiTree.cpp? (2) different version of lib, e.g., cuda driver (i use 10)? or tensorflow version (i used 1.13) ?, or (3) add addtional flag for compiling/building TF.so file?

githubharald commented 5 years ago

thank you, that is really helpful! It seems to have something to do with the code that computes the next possible words.

There is a cache in the prefix tree which may be changed from a thread while another threads reads from it. Could you try to deactivate this for a moment, e.g. by removing this line (either delete the line or comment it like shown below) and then re-compile and check if the crash disappears.

    // put result for level 1 into cache
    if (isLevel1)
    {
        // m_level1Cache[text[0]] = res; COMMENT OR DELETE THIS LINE
    }

oldyen commented 5 years ago

Thank you for quick reply!

Yes, you are right. The code works fine after comment this line. But I assume that will reduce the partial performance of forecasting (level 1?) under either NGramsForecast or NGramsForecastAndForecast. Is it correct?

So I add mutex (https://en.cppreference.com/w/cpp/thread/mutex), i.e.,

mutex.lock(); m_level1Cache[text[0]] = res; mutex.unlock();

It works fine. However it has to add -fpermissive flag when compiling files. Please let me know if you have a better solution or any suggestion. Thank you,

githubharald commented 5 years ago

good to hear. I'll fix that tomorrow.

oldyen commented 5 years ago

thank you!

pvlinh143bk commented 5 years ago

@githubharald Hi Mr.Scheidl, I have just came into your README and I have a question. In "A First Example" section, you have mentioned about feed matrix have to be shape (TxBxC). What is the meaning of term B? In almost articles I read about Word Beam Search, the term B refer to Beam Width, but in your Readme, I think it's not Beam Width. Could you explain it for me? Thank you very much.

githubharald commented 5 years ago

@pvlinh143bk: please open a new issue to avoid mixing different topics in one thread. @oldyen: updated repo, please let me know if it works for you now.

oldyen commented 5 years ago

Works fine, thank you!

githubharald / CTCWordBeamSearch

Parallel mode doesn’t work for NGramsForecast or NGramsForecastAndForecast #29