ufal / nametag

NameTag: Named Entity Tagger
Mozilla Public License 2.0
38 stars 10 forks source link

Conditional jump or move depends on uninitialised value #23

Closed jwijffels closed 1 year ago

jwijffels commented 1 year ago

Hello foxik.

In 2020 I've created an R wrapper around the nametag library. It's on CRAN since June 2020 (https://cran.r-project.org/package=nametagger). Recently the CRAN build system has changed and the build log at CRAN shows that a valgrind issue has appeared at ufal::nametag::utils::lzma::MatchFinder_Create(

==1458363== Memcheck, a memory error detector
==1458363== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1458363== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==1458363== Command: /data/blackswan/ripley/R/R-devel-vg/bin/exec/R --vanilla
==1458363== 
...
> data(europeananews)
> x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
> traindata <- subset(x, sentence_id >  100)
> testdata  <- subset(x, sentence_id <= 100)
> path <- "nametagger-nl.ner" 
> ## Don't show: 
> path <- tempfile("nametagger-nl_", fileext = ".ner")
> traindata <- subset(x, sentence_id >  100 & sentence_id < 300)
> testdata  <- subset(x, sentence_id <= 100)
> ## End(Don't show) 
> opts <- nametagger_options(file = path,
+                            token = list(window = 2),
+                            token_normalisedsuffix = list(window = 0, from = 1, to = 4),
+                            ner_previous = list(window = 2),
+                            time = list(use = TRUE),
+                            url_email = list(url = "URL", email = "EMAIL"))
> ## Don't show: 
> model <- nametagger(x.train = traindata, x.test = testdata,
+                     iter = 1, lambda = 0.5, control = opts)
==1458363== Conditional jump or move depends on uninitialised value(s)
==1458363==    at 0x17E945C7: ufal::nametag::utils::lzma::MatchFinder_Create(ufal::nametag::utils::lzma::CMatchFinder*, unsigned int, unsigned int, unsigned int, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:540)
==1458363==    by 0x17E96768: LzmaEnc_Alloc (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2984)
==1458363==    by 0x17E96768: ufal::nametag::utils::lzma::LzmaEnc_AllocAndInit(ufal::nametag::utils::lzma::CLzmaEnc*, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) [clone .constprop.0] (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3075)
==1458363==    by 0x17E96C1C: ufal::nametag::utils::lzma::LzmaEnc_MemEncode(void*, unsigned char*, unsigned long*, unsigned char const*, unsigned long, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3269)
==1458363==    by 0x17E96D00: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3293)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==  Uninitialised value was created by a heap allocation
==1458363==    at 0x48432F3: operator new[](unsigned long) (/builddir/build/BUILD/valgrind-3.21.0/coregrind/m_replacemalloc/vg_replace_malloc.c:714)
==1458363==    by 0x17E961C7: ufal::nametag::utils::lzma::LzmaEnc_Create(ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2769)
==1458363==    by 0x17E96C7B: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3283)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==    by 0x4F595F: Rf_eval (svn/R-devel/src/main/eval.c:1146)
==1458363== 
> ## End(Don't show)

Although in the R package (at https://github.com/bnosac/nametagger) I use commit 598666b5aa2f3ebbb9658976fe06749f551aed02 of nametag, I think the issue is still there in the current version of nametag. Not sure how difficult it is to fix this.

foxik commented 1 year ago

Hi,

the issue is in the public-domain code performing LZMA (de)compression (we use it to compress the parts of the models where it make sense). The same code is also in UDPipe 1, MorphoDiTa, ... .

As you say, I am not sure how difficult will be to find the cause (given that it is someone else's code), but I will try to do it next week (and do minor releases with the fix).

jwijffels commented 1 year ago

That would be great! Thanks for the effort.

foxik commented 1 year ago

Hi,

after investigation, I believe it is just a false positive report by valgrind (the code actually does read an uninitialized value, but the conditional jump is guarded by other conditions, so the program behavior cannot be influenced by the unitialized value). I tried to fix the problem by https://github.com/ufal/cpp_utils/commit/fc215c8d68224ac50f4e9026be15c0953adcc611 (on g++-10 and clang-11 the valgrind report disappeared) and then updated cpp_utils in nametag in the lastest commit 294466f.

Could you please try building the nametag R wrapper to verify the problem is really solved? Thanks!

jwijffels commented 1 year ago

Many thanks for the fixes and your time. I've incorporated the changes and uploaded to CRAN. It's currently under CRAN review. I tried to reproduce the valgrind issue myself but it depends clearly on valgrind 3.21.0 and I don't have an instrumented build of R-devel compiled with valgrind level 2 instrumentation alongside valgrind 3.21.0 to be able to reproduce it.

What I did already got as feedback from CRAN is that I had to replace std::interator as it is deprecated in C++17 and future compilers will not support C++11. I'll make a new issue for this if you don't mind.

I'll keep you updated here if I have more feedback from CRAN.

Found the following significant warnings:
  ./nametag/src/unilib/utf8.h:148:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  ./nametag/src/unilib/utf8.h:180:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf16.h:116:53: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf16.h:148:53: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf8.h:148:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
  nametag/src/unilib/utf8.h:180:52: warning: 'template<class _Category, class _Tp, class _Distance, class _Pointer, class _Reference> struct std::iterator' is deprecated [-Wdeprecated-declarations]
jwijffels commented 1 year ago

The R package was accepted on CRAN yesterday including the fix for this issue which incorporated your changes of https://github.com/ufal/nametag/commit/294466f7a5e420165801d531614e9cc5646f48af. Yesterday CRAN tested as well with valgrind and the fix solved the problem. Many thanks again!