jermp / tongrams

A C++ library providing fast language model queries in compressed space.
MIT License
128 stars 20 forks source link

lookup() - Segmentation fault when ngram is not in data structure #2

Closed ndvbd closed 5 years ago

ndvbd commented 5 years ago

I am trying to use Tongrams, and to eventually write a python wrapper for it. For now, I created an Eclipse CDT project from the cmake files using: cmake -G "Eclipse CDT4 - Unix Makefiles" ./

I created the data structure (pef_trie) from the test set. Now when I try to lookup for an ngram which is not found I get a segmentation fault:

stl_string_adaptor adaptor;
uint64_t value1 = model.lookup("or compilation before it can", adaptor); // Works well
std::cout << value1 << std::endl;
uint64_t value2 = model.lookup("or compilation before it or", adaptor); // Segmentation Fault
jermp commented 5 years ago

That should never happen. I tested the library with ngrams not indexed by the data structure, e.g., when computing perplexity score from a text. Can you do more tests? Are you trying to read some queries from a file? You can text me at my e-mail in my profile if you prefer. Anyway, I will perform some sanity checks and verify that behaviour myself.

jermp commented 5 years ago

Also, try to recompile the code with -DCMAKE_BUILD_TYPE=Release and run the tests again. You could see this exception arising: https://github.com/jermp/tongrams/blob/master/vectors/sorted_array.hpp#L111

ndvbd commented 5 years ago

@jermp is it possible to return -1 when ngram is not found in data structure?

jermp commented 5 years ago

Yes, I will implement this. So far, the trie that stores the count assumes ngrams are always found.

jermp commented 5 years ago

Done. See also the test_data/queries.not_found for some examples of strings that must not be found after indexing the test_data.