opencog / atomspace

The OpenCog (hyper-)graph database and graph rewriting system
https://wiki.opencog.org/w/AtomSpace
Other
798 stars 225 forks source link

SchemeEval run smoothly on x64 and i386 but it crashes on armv7-a #2944

Open ghost opened 2 years ago

ghost commented 2 years ago

SchemeEval run smoothly on x64 and i386 but it crashes on armv7-a.

Steps to reproduce bug

  1. Download datomspace-tester.apk

  2. Install datomspace-tester.apk (Do not run!)

  3. Go to Settings -> Apps -> dAtomSpace Tester. Set Storage permission.

  4. Run dAtomSpace Tester

  5. App runs about 30 seconds, then it crashes. There is tombstone file.

  6. View results in datomspace-test.txt file in Download folder

Source codes

dAtomSpace Tester

dAtomSpace

SchemeEval.java
...

package com.cogroid.atomspace;

public class SchemeEval extends GenericEval {
...

    public SchemeEval(AtomSpace as) {
        super();
        jni_ptr = jni_init(as.jni_ptr);
    }

    private native long jni_init(long as_jni_ptr);

    // Call before first use.
    public static void init_scheme() {
        jni_init_scheme();
    }

    private static native void jni_init_scheme();

... 
com_cogroid_atomspace_SchemeEval.h
...
JNIEXPORT jlong JNICALL Java_com_cogroid_atomspace_SchemeEval_jni_1init
  (JNIEnv *, jobject, jlong);

...
JNIEXPORT void JNICALL Java_com_cogroid_atomspace_SchemeEval_jni_1init_1scheme
  (JNIEnv *, jclass);

...
com_cogroid_atomspace_SchemeEval.cc
...
JNIEXPORT jlong JNICALL Java_com_cogroid_atomspace_SchemeEval_jni_1init
  (JNIEnv *env, jobject thisObj, jlong as_jni_ptr) {
    opencog::AtomSpace *asp = NULL;
    if (as_jni_ptr != 0) {
        cogroid::SPW<opencog::AtomSpace> *spw_asp = cogroid::SPW<opencog::AtomSpace>::get(as_jni_ptr);
        asp = spw_asp->get();
    }
    cogroid::SPW<opencog::SchemeEval> *spw_se = new cogroid::SPW<opencog::SchemeEval>(asp);
    return spw_se->instance();
}

...
JNIEXPORT void JNICALL Java_com_cogroid_atomspace_SchemeEval_jni_1init_1scheme
  (JNIEnv *env, jclass clz) {
    opencog::SchemeEval::init_scheme();
}

...
Tester.java
...
public void testSchemeEval() {
    try {
        String log = "\n===== SchemeEval =====\n";
        writeLog(log);
        try {
            SchemeEval.init_scheme();
        } catch (Throwable e) {
            writeLog(Loader.me().stackTrace(e));
        }
        AtomSpace pv = new AtomSpace();
        SchemeEval se = new SchemeEval(pv);
        String tmpFolder = new java.io.File(_logFile).getParentFile().getAbsolutePath();
        extractScmFiles();
        java.util.List<String> files = scmFiles();
        for (int i = 0; i < files.size(); i++) {
            String fn = files.get(i);
            String text = readTextFile(fn, tmpFolder);
            try {
                writeLog("----- Eval: " + fn + " -----");
                String rs = "";
                se.begin_eval();
                writeLog("begin_eval();");
                se.eval_expr(text);
                writeLog("eval_expr();");
                rs = se.poll_result();
                writeLog("poll_result();");
                //String rs = se.eval(text);
                writeLog(rs);
            } catch (Throwable e) {
                writeLog(Loader.me().stackTrace(e));
            }
        }
    } catch (Throwable e) {
        writeLog(Loader.me().stackTrace(e));
    }
    }
...
linas commented 2 years ago

According to this page: https://www.gnu.org/software/guile/manual/html_node/Compilation.html the *.go files contain CPU-architecture-dependent code. There is a specific --target=target flag on the guild compiler to specify the target architecture. That means, when building guile, you will need to set it up correctly for cross-compilation. I'm sure you did this when compiling the c files, but I am guessing the makefiles did not set the target correctly for the .scm->.go files. (But that is a guess.)

If/when I fix #2945 this might also present cross-compilation challenges. Not sure what to do about this ...

linas commented 2 years ago

I'm also thinking that the __aeabi_idiv0 bug seen earlier is a side-effect of the .go files being architecture-dependent. That is, the .go files contain a kind of RTL (its either GNU Lightning or a derivative of that) and that RTL ("register transfer language" or "bytecode") is executed on arm7 by calling tiny little arm7 instruction stubs such as __aeabi_idiv0 ... so this again suggests the *.go files need to be recompiled form arm7.

The above is just an educated guess, though. I could be wrong.

ghost commented 2 years ago

I just launch a-jsb.com for running javascript in sandbox with atomspace.

linas commented 2 years ago

I just launch a-jsb.com for running javascript in sandbox with atomspace.

Wow. Well, that is unexpected! It looks like the execSCM call worked, but I guess that this is an x86 version, and not arm7 ? I'm still very eager to get the arm7 issues figured out and fixed.

ghost commented 2 years ago

I compiled datomspace-tester.apk with more logs. Following are files:

  1. datomspace-tester.apk
  2. opencog/guile/SchemeEval.cc
  3. libguile/eval.c
  4. libguile/init.c
  5. libguile/load.c
  6. libguile/threads.c
  7. libguile/vm.c
  8. Download/datomspace-guile-eval.txt
  9. Download/datomspace-guile-init.txt
  10. Download/datomspace-guile-load.txt
  11. Download/datomspace-guile-vm.txt
  12. Download/datomspace-guile.txt
  13. Download/datomspace-load.txt
  14. Download/datomspace-stderr.txt
  15. Download/datomspace-stdout.txt
  16. Download/datomspace-test.txt

I am stuck at following error:

At libguile/init.c

At scm_load_startup_files ()

scm_c_primitive_load_path ("ice-9/boot-9");
At libguile/vm.c
...
fprintf(fh_vm, "scm_call_n #26\n");
fflush(fh_vm);

    ret = vm_engines[vp->engine](thread, vp, &registers, resume);

...

    vp->resumable_prompt_cookie = prev_cookie;

fprintf(fh_vm, "scm_call_n #28\n");
fflush(fh_vm);

It repeats "scm_call_n #26", then "scm_call_n #28", then "scm_call_n #26" again several times. After that, it stopped.

linas commented 2 years ago

It repeats "scm_call_n https://github.com/opencog/atomspace/pull/26", then

Yeah, that's going to be a hard way to debug. Poking through that stuff is like .. debugging assembly code. And anyway, I doubt that is where the bug is. Based on several of your tombstone files, the garbage collector was accessing bad memory, and so the question is "why is it doing that?" So, some background:

When the GC runs, it searches for pointers in all of the stacks and in any malloced RAM it knows about. It is not supposed to search outside of these boundaries. Yet, clearly, this is happening: in the first tombstone, it access memory about 300 bytes away from valid RAM, and in the second tombstone, only about 8K away. These offsets are tiny: both are less than 16-bits away from a valid address. I mean, out of a giant 4GB address space, it didn't access some "random" address, it access something really close by.

This less-than-16-bit mistake suggests to me that guile is using a 16-bit short for some offset. I am guessing that, due to architecture confusion, this offset is being added instead of subtracted. How could this happen? Here are my guesses:

  1. libgc is broken or miscompiled for arm7. After compiling libgc, did you run the unit tests? Did they all pass?
  2. The guile .go files contain some kind of architecture-dependent code, for example: address-offset info, (indirect addressing), stack-growth direction, endianness ... and the .go files are compiled for some other architecture, and not arm7. Are you building the *.go files on arm7, or are you cross-compiling?
  3. The *go files contain GNU-lighting RTL. It looks like this:
$ guild disassemble ./srfi/srfi-1.go
  44    (mov 1 7)                                             at srfi/srfi-1.scm:830:11
  45    (handle-interrupts)             
  46    (call 7 2)                      
  48    (receive 4 7 9)                 
  50    (immediate-tag=? 4 3839 4)      ;; false?             at srfi/srfi-1.scm:828:4
  52    (jne 6)                         ;; -> L3
  53    (scm-ref/immediate 5 8 1)                             at srfi/srfi-1.scm:837:17
  54    (mov 4 5)                                             at srfi/srfi-1.scm:837:11
  55    (mov 5 8)                       

The mov and jne and call are translated into arm7 pseudo-assembly: they are calls to functions such as __aeabi_movi and __aeabi_jne and whatever: these are very short subroutines in the arm7 libc.so that are just wrappers for one or two arm7 assembly instructions. It is possible that maybe this translation is incorrect.

I asked the guile gurus about about arm7 on IRC chat. They said it works fine on Android. They said "just install guix, you'll see" (guix is a guile linux distro.) So, here's how we can check this:

A. Install a terminal emulator on the phone B. run the guile shell on the phone, from the terminal emulator. Its in /sdcard1/something/bin/guile C. At the guile prompt, run some scheme commands:

(+ 2 2)
(display "hello world\n")
(gc)
(gc-stats)

The (gc) call forces GC to run, and (gc-stats) prints some statistics. All sizes are in bytes, times are in nanoseconds, something like that. All of this should work. If this does NOT work ... then ... let me know.

I could not do this myself, because running /sdcard1/something/bin/guile complained that it was unable to find libsomething.so and so the whole C shared library environment needs to be set up.

If the above does work, then try

(use-modules (opencog))
(Concept "foo")
(gc)
(gc-stats)

If that works, then ???

linas commented 2 years ago

I wrote the above before reading through your files. I'll read your files shortly.

linas commented 2 years ago

again several times. After that, it stopped.

Did it hang, or did it crash? If it hangs, did you look at the cpu usage? Is the CPU usage 100% or 0% -- If it's 100%, then it is probably trying to compile ice-9/boot-9 which could take minutes or hours .. or days?

If it's hung, but there is no CPU usage, then .. ugh. We'd have to use gdb. But first, please check everything I mentioned earlier.