jonas-ellert / nearest-smaller-suffixes

MIT License
2 stars 2 forks source link

Compilation issues #1

Open felipelouza opened 1 month ago

felipelouza commented 1 month ago

Hi Jonas,

I had the following problems when running make check:

/home/felipelouza/nearest-smaller-suffixes/benchmark/test/test_xss.cpp: In lambda function: /home/felipelouza/nearest-smaller-suffixes/benchmark/test/test_xss.cpp:73:31: error: ‘get’ is not a member of ‘xss’ 73 | auto pss_and_nss = xss::get<xss::PSS, xss::NSS>(t.data(), t.size()); | ^~~ ... /usr/include/c++/14/bits/stl_pair.h:1250:5: note: ‘std::get’ /home/felipelouza/nearest-smaller-suffixes/benchmark/test/test_xss.cpp:73:40: error: ‘PSS’ is not a member of ‘xss’ 73 | auto pss_and_nss = xss::get<xss::PSS, xss::NSS>(t.data(), t.size()); | ^~~ /home/felipelouza/nearest-smaller-suffixes/benchmark/test/test_xss.cpp:74:34: error: ‘get’ is not a member of ‘xss’ 74 | auto pss_and_lyndon = xss::get<xss::PSS, xss::LYNDON>(t.data(), t.size()); | ^~~

Do you have any hints on what might be happening?

Thank you!

felipelouza commented 1 month ago

$ g++ --version g++ (GCC) 14.1.1 20240701 (Red Hat 14.1.1-7)

felipelouza commented 1 month ago

Also, when I try `make benchmark':

make[3]: *** No rule to make target 'benchmark/src/external/tlx/lib/libtlx.a', needed by 'benchmark/src/benchmark'. Stop. make[2]: *** [CMakeFiles/Makefile2:388: benchmark/src/CMakeFiles/benchmark.dir/all] Error 2 make[1]: *** [CMakeFiles/Makefile2:395: benchmark/src/CMakeFiles/benchmark.dir/rule] Error 2 make: *** [Makefile:254: benchmark] Error 2

jonas-ellert commented 1 month ago

Can you send me a complete log (everything after cloning the repo)? The missing xss::get seems to be an artifact from a previous version of the project; I'll look into it on Friday. For tlx, does make fetch_tlx succeed?

Cheers!

felipelouza commented 1 month ago

Hi,

The first attempt produced this output: https://gist.github.com/felipelouza/ccd128e3b2ad59b211590f27434e1920

Then I changed the lines: #include <tree/...> to #include <xss/tree/...>

In the files /benchmark/test/test_stack.cpp, /benchmark/test/util/check_array.hpp and /nearest-smaller-suffixes/benchmark/test/util/check_tree.hpp

And, I got this error: https://gist.github.com/felipelouza/4e24d48524b641de9fe1e1a652058bf8

The make fetch_tlx worked.

The thing I would like to do it to run the Lyndon array construction algorithm.

Thank you!

jonas-ellert commented 1 month ago

'make check' should be fine now; for the tlx issue, can you send me the log of 'make fetch_tlx' from a clean build directory (i.e., only 'cmake ..' and 'make fetch_tlx')?

Looks like this for me: https://gist.github.com/jonas-ellert/674860a2be1040a7e841abb9513c31e9

In the meantime, I have created a branch 'single-header' with a self-contained implementation of the array construction algorithm, which should be easy to integrate into other projects. I can produce a similar header for the tree construction if needed. Hope this helps!

felipelouza commented 1 month ago

Thank you Jonas,

make check was successful, but make fetch_tlx failed. You can see the details here: https://gist.github.com/felipelouza/4d0fa4a1fe306de08313349a0994a03d.

If I want to build the Lyndon Array, should I write my own code like the example here: https://github.com/jonas-ellert/nearest-smaller-suffixes-mwe/blob/master/main_plain.cpp, or is there another method available?

Cheers!

jonas-ellert commented 1 month ago

The tlx issue should be resolved on master branch, please check and let me know if this is true (clean build required).

Easiest way to build LA is to include this from branch single-header:

https://github.com/jonas-ellert/nearest-smaller-suffixes/blob/single-header/include/lyndon_array.hpp

And then use like here:

https://github.com/jonas-ellert/nearest-smaller-suffixes/blob/single-header/example.cpp

Sentinel characters at the beginning and end of the string are needed!

felipelouza commented 1 month ago

It worked perfectly: [100%] Built target fetch_tlx

Thank you, @jonas-ellert!

felipelouza commented 1 month ago

Hi Jonas,

I have one more question. I executed the LA-plain (https://github.com/felipelouza/nearest-smaller-suffixes/blob/master/main.cpp#L45), but it didn't achieve O(1) workspace as expected. Is that correct?

Thanks!

jonas-ellert commented 1 month ago

You don't need two vectors if you only want the Lyndon array (rather than PSS + Lyndon), see here:

https://gist.github.com/jonas-ellert/441d61b8bd6c7e7e4316f3ff0339fb3d

The implementation uses O(1) space ;)

felipelouza commented 1 month ago

Do you have any idea why it didn't work for https://pizzachili.dcc.uchile.cl/repcorpus/real/coreutils.gz (unziped)??

$ ./la-plain ~/dataset/pizza/coreutils
Segmentation fault (core dumped)
felipelouza commented 1 month ago

The same happened with: einstein.de.txt einstein.en.txt english.001.2 kernel world_leaders :(

jonas-ellert commented 1 month ago

This is likely because of missing sentinels. They need to be smaller than all characters present in the string; particularly, coreutils contains both std::numeric_limits::min() and max(), so a slightly more complicated normalization is necessary, see, e.g., here:

https://github.com/jonas-ellert/nearest-smaller-suffixes/blob/master/benchmark/include/file_util.hpp

The algorithm can work without sentinels, but then a slightly more complicated implementation is needed (checking the special corner cases if and only if they actually arise). My best guess is that this may slow down the algorithm by around 10% or so (there is no change in cache behavior, so the impact should be minor). Either way, for a fair comparison of algorithms, it should either be assumed that none of the algorithms need any standardization of input, or all of them get the input in their respectively required format ;)

felipelouza commented 1 month ago

I see. I tried a simple scan over the text increasing by one every symbol smaller than 255 (leaving the '0' for the sentinels), but the code crashed as well.

Thank you anyway, you helped a lot. The algorithm is really fast, we are trying to reduce the workpace to compute LA (with a semi-external memory approach). Also, the memory we measured for LA-plain was not always about 5n. For some inputs it needed about 6n bytes per symbol (for example: cere, fib41, influenza), any idea why?

All the best!