[bug report] libdeepspeech android JNI crash in utf-8 String convert with zh-cn model.

park671 commented 1 week ago

unmodified clone of the example repository:
OS Platform: Samsung SM-M325FV / Redmi K30, both android 13.
**Code version: libdeepspeech-0.9.3(from github release)
**Model version: deepspeech-0.9.3-models-zh-cn
TensorFlow installed from (android system tf-lite.):
TensorFlow version (-):
Python version -:
Bazel version (-):
GCC/Compiler version (-):
CUDA/cuDNN version -:
GPU model and memory -:
Exact command to reproduce -:

park671 commented 1 week ago

bug fix recommend

https://stackoverflow.com/questions/60722231/jni-detected-error-in-application-input-is-not-valid-modified-utf-8-illegal-st

You cannot use NewStringUTF for this, you will have to manually decode from UTF-8.

Cribbing from, we will do the equivalent of Charset.forName("UTF-8").decode(bb).toString(): as follows, where each paragraph roughly implements one step, and the last sets your object field to the result:

jobject bb = env->NewDirectByteBuffer((void *) cStringValue, strlen(cStringValue));

jclass cls_Charset = env->FindClass("java/nio/charset/Charset");
jmethodID mid_Charset_forName = env->GetStaticMethodID(cls_Charset, "forName", "(Ljava/lang/String;)Ljava/nio/charset/Charset;");
jobject charset = env->CallStaticObjectMethod(cls_Charset, mid_Charset_forName, env->NewStringUTF("UTF-8"));

jmethodID mid_Charset_decode = env->GetMethodID(cls_Charset, "decode", "(Ljava/nio/ByteBuffer;)Ljava/nio/CharBuffer;");
jobject cb = env->CallObjectMethod(charset, mid_Charset_decode, bb);

jclass cls_CharBuffer = env->FindClass("java/nio/CharBuffer");
jmethodID mid_CharBuffer_toString = env->GetMethodID(cls_CharBuffer, "toString", "()Ljava/lang/String;");
jstring str = env->CallObjectMethod(cb, mid_CharBuffer_toString);

env->SetObjectField(jPosRec, myJniPosRec->_myJavaStringValue, str);

park671 commented 1 week ago

bugfix log1:

The above Stack Overflow might be misleading. The potential cause of this bug could be that the C++ JNI interface, autogenerated by SWIG, does not support incomplete UTF strings generated by the module during type conversion.

park671 commented 1 week ago

bug fixed！here is my solution:

As I mentioned above, the bug is due to the SWIG auto-generated JNI translation layer, which crashes when dealing with incomplete UTF-8 encoding (Chinese characters are 3 bytes). I have fixed this issue by using an inline hook to modify the char* returned by the DS_IntermediateDecode method, truncating the incomplete characters at the end.

proxy_DS_IntermediateDecode's code :

void *proxy_DS_IntermediateDecode(void *aSctx) {
    LOG("proxy_DS_IntermediateDecode(): aSctx addr=%p", aSctx);
    char *result = (char *)((DS_IntermediateDecode) orig_DS_IntermediateDecode)(aSctx);
    int len = strlen(result);
    if (len <= 0) {
        origin_string = NULL;
        return result;
    }
    origin_string = result;
    LOG("proxy_DS_IntermediateDecode(): strlen=%d", len);
    char *complete_utf8_string = get_complete_utf8_string(origin_string, len);
    LOG("proxy_DS_IntermediateDecode(): origin=%s --> complete=%s", (char *) origin_string, complete_utf8_string);
    return complete_utf8_string;
}

proxy_DS_FreeString's code:

void proxy_DS_FreeString(char *complete_utf8_string) {
    LOG("proxy_DS_FreeString(): %s", complete_utf8_string);
    ((DS_FreeString) orig_DS_FreeString)(complete_utf8_string);
    if (origin_string != NULL) {
        free(origin_string);
        origin_string = NULL;
    }
    return;
}

decleard field: char *origin_string = NULL; get_complete_utf8_string's code:

char *get_complete_utf8_string(const char *input, int input_length) {
    char *output = (char *) malloc(input_length + 1);
    if (output == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        return NULL;
    }

    int i = 0;
    int output_index = 0;
    while (i < input_length) {
        unsigned char lead = input[i];
        int char_size = 0;
        if (lead < 0x80) {
            LOG("1byte utf8");
            char_size = 1;
        } else if ((lead >> 5) == 0x6) {
            LOG("2byte utf8");
            char_size = 2;
        } else if ((lead >> 4) == 0xE) {
            LOG("3byte utf8");
            char_size = 3;
        } else if ((lead >> 3) == 0x1E) {
            LOG("4byte utf8");
            char_size = 4;
        } else {
            i++;
            continue;
        }
        if (i + char_size > input_length) {
            LOG("incomplete utf8!");
            break;
        }
        memcpy(output + output_index, input + i, char_size);
        output_index += char_size;
        i += char_size;
    }
    output[output_index] = '\0';
    return output;
}

mozilla / DeepSpeech