potassco / clingo

🤔 A grounder and solver for logic programs.
https://potassco.org/clingo
MIT License
589 stars 79 forks source link

Clingo/gringo/clasp: incorrect behavior on Android 11+ #475

Closed mbalduccini closed 6 days ago

mbalduccini commented 7 months ago

Problem description Clingo/gringo/clasp do not work properly on Android 11+. Grounding produces no aspif rules and solving produces empty answer sets even when given non-empty aspif files. The issue is due clingo's use of some memory pointers in a way that is incompatible with ARM Memory Tagging Extension (MTE) support. Details can be found at https://source.android.com/docs/security/test/tagged-pointers.

This affects, for instance, clasp's StrRef (file clasp/src/shared_context.cpp). While StrRef's modification to bit 63 of the allocated pointer is only temporary and does not flow back to the call to free(), StrRef assumes that bit 63 of the pointer returned by malloc() is always 0. That is not guaranteed under Android 11+ and in fact the bit is often set to 1. As a result, clasp gets confused as to what type of data structure is stored in that memory area and ends up displaying empty answer sets. Interestingly, the underlying computation is correct to the best of my knowledge and only the displaying of the answer sets is affected. I have tested this by writing a small test C++ program that calls solve() and handles on its own the task of displaying the answer sets.

Note that this issue is not limited to Android. In principle, it may affect any ARM-based device whose OS adopts MTE.

Symptoms The tests below are run in a termux terminal running on Galaxy Z Fold5, Galaxy S10e and Galaxy Tab S7 with the same results. Clingo was compiled with clang 17.0.6, the stock compiler in termux. So far, I have been unable to build clingo with gcc on Android.

gringo issue

% echo "p." | build/bin/gringo
asp 1 0 0
0

clingo/clasp issue

% echo "p." | build/bin/clingo
clingo version 5.6.2
Reading from stdin
Solving...
Answer: 1

SATISFIABLE

Models       : 1
Calls        : 1
Time         : 0.001s (Solving: 0.00s 1st Model: 0.00s Unsat: 0.00s)
CPU Time     : 0.001s

Possible solutions A few possible solutions exist, but none is entirely pain-free.

Solution 1 (Tested) Use a malloc() replacement such as dmalloc:

  1. Download and install dmalloc (https://dmalloc.com)
  2. Add /path/to/dmalloc/libdmalloc.a to the linker flags when you configure clingo, i.e.:
    cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS="/path/to/dmalloc/libdmalloc.a" -DCMAKE_EXE_LINKER_FLAGS_RELEASE="/path/to/dmalloc/libdmalloc.a"
    cmake --build build
  3. Any application that uses a clingo shared object will need to also be linked against libdmalloc.a

Solution 2 (Tested) Compile clingo statically against the static version of libc found here: https://android.googlesource.com/toolchain/prebuilts/ndk/r17/+/refs/heads/main/sysroot/usr/lib/aarch64-linux-android Needless to say, this solution does not produce shared objects. The steps are similar to Solution 1. I turned off python and lua and I am not sure this solution supports them.

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release -DCLINGO_BUILD_WITH_PYTHON=Off -DCLINGO_BUILD_WITH_LUA=Off -DCLINGO_BUILD_WITH_LUA=Off -DCLINGO_BUILD_SHARED=Off -DCLASP_BUILD_WITH_THREADS=Off -DCMAKE_VERBOSE_MAKEFILE=On -DCMAKE_BUILD_TYPE=release -DCMAKE_EXE_LINKER_FLAGS="-static -L/path/to/ndk/libc/" -DCMAKE_EXE_LINKER_FLAGS_RELEASE="-static -L/path/to/ndk/libc/"
cmake --build build

Solution 3 (Not tested) Building a 32-bit version of clingo should eliminate the problem. See here for a discussion: https://github.com/termux/termux-packages/issues/7332

Solution 4 (Not tested) Another solution provided at https://github.com/termux/termux-packages/issues/7332 relies on disabling tagged pointers via code. I hypothesize that clingo/gringo/clasp could incorporate the code given in this post: https://github.com/termux/termux-packages/issues/7332#issuecomment-1078588433. Possibly a headache for the Potassco team since it introduces architecture/OS-dependent code.

Solution 5 (Not tested) A more seamless solution for users -- but more work for the Potassco team! ;) -- is to modify clingo/gringo/clasp and move all tags to the second byte. Discussion: https://github.com/apple/swift/pull/40779

rkaminsk commented 7 months ago

Just had a quick look. To me it seems that we should actually aim for solution 5. @BenKaufmann, can't we simply use bit 0 instead of 63 here? The bit should be guaranteed to be zero because of alignment. Or am I missing something with the string literals?

mbalduccini commented 7 months ago

To be clear: StrRef is not the only location where Android's memory handling gets in the way. To the best of my knowledge, it is the only location in clasp, but there is something in gringo as well. While I haven't been able to identify the exact spot, statement auto it = atoms_.find(repr.eval(undefined, log)); in libgringo/gringo/domain.hh:486 yields incorrect results (it returns 0) on Android for program p. leading to no ground rules being emitted. And there may be a different issue leading to the lack of #show statements, I just haven't had the time to investigate. However, all of these issues disappear if the Android/bionic malloc() is replaced by a different one whose pointers have the top byte set to 0. (I am happy to share the notes I took while tracing gringo's execution through atoms_.find() if it helps.) To note: by enabling various diagnostic output statements in gringo and tracing through it, I came to the conclusion that, just like clasp, gringo computes the grounding correctly and only fails to emit it.

BenKaufmann commented 7 months ago

Just had a quick look. To me it seems that we should actually aim for solution 5. @BenKaufmann, can't we simply use bit 0 instead of 63 here? The bit should be guaranteed to be zero because of alignment. Or am I missing something with the string literals?

No, the way that ConstString is designed, we can't use the lsb. This is because the idea is that it can be constructed from arbitrary string literals (without copying) and hence with pointers having arbitrary alignment. E.g.

const char* foo = "12345";
auto ok  = ConstString::fromLiteral(foo); // lsb not set
auto bad = ConstString::fromLiteral(foo + 1); // lsb set

The current implementation is indeed broken and not following good practice (see e.g. https://muxup.com/2023q4/storing-data-in-pointers). I currently don't have the time to deep dive into ConstString but from a quick glance it looks like:

Hence, maybe this could be turned into a non-issue by just dropping the (unused?) no-copy optimization.

rkaminsk commented 7 months ago

To be clear: StrRef is not the only location where Android's memory handling gets in the way. To the best of my knowledge, it is the only location in clasp, but there is something in gringo as well. While I haven't been able to identify the exact spot, statement auto it = atoms_.find(repr.eval(undefined, log)); in libgringo/gringo/domain.hh:486 yields incorrect results (it returns 0) on Android for program p. leading to no ground rules being emitted. And there may be a different issue leading to the lack of #show statements, I just haven't had the time to investigate. However, all of these issues disappear if the Android/bionic malloc() is replaced by a different one whose pointers have the top byte set to 0. (I am happy to share the notes I took while tracing gringo's execution through atoms_.find() if it helps.) To note: by enabling various diagnostic output statements in gringo and tracing through it, I came to the conclusion that, just like clasp, gringo computes the grounding correctly and only fails to emit it.

I had a quick look, clingo indeed uses the upper bits for symbols. The good news is that this is going to change in some future release. The bad news is that this is going to take a while.

rkaminsk commented 7 months ago

Solution 1 (Tested) Use a malloc() replacement such as dmalloc:

1. Download and install dmalloc (https://dmalloc.com)
2. Add `/path/to/dmalloc/libdmalloc.a` to the linker flags when you configure clingo, i.e.:
cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS="/path/to/dmalloc/libdmalloc.a" -DCMAKE_EXE_LINKER_FLAGS_RELEASE="/path/to/dmalloc/libdmalloc.a"
cmake --build build
3. Any application that uses a clingo shared object will need to also be linked against `libdmalloc.a`

Are you sure about point 3. The clingo library manages it's memory completely by itself. It should be safe to use the shared object together with other libraries relying on different malloc implementations.

mbalduccini commented 7 months ago

Solution 1 (Tested) Use a malloc() replacement such as dmalloc:

  1. Any application that uses a clingo shared object will need to also be linked against libdmalloc.a

Are you sure about point 3. The clingo library manages it's memory completely by itself. It should be safe to use the shared object together with other libraries relying on different malloc implementations.

I am. Here is a minimal example:

#include <iostream>
#include <clingo.hh>

using namespace Clingo;

int main(int argc, char * const argv[]) {
    try {
        Logger logger = [](Clingo::WarningCode, char const *message) {
            std::cerr << message << std::endl;
        };
        Control ctl{{argv+1, size_t(argc-1)}, logger, 20};

        std::cout << "Ground" << std::endl;
        ctl.add("base", {}, "p.");
        ctl.ground({{"base", {}}});

        std::cout << "Solve" << std::endl;
        clingo_propagator_t prop = {
          NULL,
          NULL,
          NULL,
          NULL,
          NULL,
        };
        if (!clingo_control_register_propagator(ctl.to_c(), &prop, NULL, false))
            throw std::runtime_error("could not register C propagator");

        ctl.solve();
    }
    catch (std::exception const &e) {
        std::cerr << "example failed with: " << e.what() << std::endl;
    }
}

Linked with dmalloc(), everything works fine. Linked without dmalloc(), it terminates with the error that is typical of tagged pointer issues on Android:

Ground
Solve
Pointer tag for 0x9400007100418440 was truncated, see 'https://source.android.com/devices/tech/debug/tagged-pointers'.
Abort

What seems to trigger the problem is the registration of the propagator, or some code that is executed because of it. If you remove the registration, the error disappears. (I created a NULL propagator to keep the program small, but the error occurs even if you provide pointers to valid callback functions.)

rkaminsk commented 7 months ago

I am not sure what is happening here. Maybe further options are required when building the libclingo.so library.

In any case, I can promise that these issues will go away with future releases. Tagging the upper bits of pointers was not a good idea in the first place. There are more and more extensions that increase the virtual address space. We have one machine that supports 57bit virtual addresses (because of some Intel extension).

mbalduccini commented 7 months ago

Great! I am happy to run tests on Android on beta versions, just keep me in the loop. Incidentally, is there a tentative date for the next release?

rkaminsk commented 7 months ago

Great! I am happy to run tests on Android on beta versions, just keep me in the loop. Incidentally, is there a tentative date for the next release?

In the beginning of next year. The wip branch should be quite stable by now. There won't be any patches regarding MTE, however.

rkaminsk commented 6 days ago

Closing this. Future clingo versions won't touch the high bits of pointers anymore. There will still be some time before releases however.