kcl-lang / kcl

KCL Programming Language (CNCF Sandbox Project). https://kcl-lang.io
https://kcl-lang.io
Apache License 2.0
1.68k stars 118 forks source link

[Bug] `ExecArtifact` crash on concurrency execution (native API) #1168

Closed bozaro closed 2 months ago

bozaro commented 7 months ago

Bug Report

ExecArtifact crash on parallel execution (serial execution works fine).

1. Minimal reproduce step (Required)

Clone small test (go lang):

git clone https://github.com/bozaro/kcl-issue.git --branch concurrency-exec-issue .
go test .

Test source code:

package kcl_parse_file_issue

import (
    "encoding/json"
    "kcl-lang.io/kcl-go/pkg/service"
    "path"
    "strconv"
    "sync"
    "testing"

    "github.com/stretchr/testify/require"
    "kcl-lang.io/kcl-go/pkg/native"
    "kcl-lang.io/kcl-go/pkg/spec/gpyrpc"
)

func Test(t *testing.T) {
    client := native.NewNativeServiceClient()
    // Compile file
    output := buildProgram(t, client)
    require.NotEmpty(t, output)

    t.Run("serial execution", func(t *testing.T) {
        for i := 0; i < 1000; i++ {
            checkExecute(t, client, output, int64(i))
        }
    })
    t.Run("parallel execution", func(t *testing.T) {
        var wg sync.WaitGroup
        threads := 10
        wg.Add(threads)
        c := make(chan int64, threads*10)
        for i := 0; i < threads; i++ {
            go func() {
                defer wg.Done()
                for id := range c {
                    checkExecute(t, client, output, id)
                }
            }()
        }
        for i := 0; i < 1000; i++ {
            c <- int64(i)
        }
        close(c)
        wg.Wait()
    })
}

func buildProgram(t *testing.T, client service.KclvmService) string {
    tempDir := t.TempDir()
    result, err := client.BuildProgram(&gpyrpc.BuildProgram_Args{
        ExecArgs: &gpyrpc.ExecProgram_Args{
            KFilenameList: []string{"source.k"},
            KCodeList:     []string{`v=option("foo")`},
        },
        Output: path.Join(tempDir, "kcl"),
    })
    require.NoError(t, err, "BuildProgram returns error")
    return result.Path
}

func checkExecute(t *testing.T, client service.KclvmService, output string, id int64) {
    value := strconv.FormatInt(id, 10)
    result, err := client.ExecArtifact(&gpyrpc.ExecArtifact_Args{
        ExecArgs: &gpyrpc.ExecProgram_Args{
            Args: []*gpyrpc.CmdArgSpec{{
                Name:  "foo",
                Value: value,
            }},
        },
        Path: output,
    })
    require.NoError(t, err, "ExecArtifact returns errors")

    var config struct {
        V int64 `json:"v"`
    }
    err = json.Unmarshal([]byte(result.JsonResult), &config)
    require.NoError(t, err, "Can't parse configuration")

    require.Equal(t, id, config.V)
}

2. What did you expect to see? (Required)

All test passed

3. What did you see instead (Required)

Errors like:

        kcl_concurrency_issue_test.go:72: 
                Error Trace:    kcl_concurrency_issue_test.go:72
                                            kcl_concurrency_issue_test.go:36
                                            /usr/lib/go-1.22/src/runtime/asm_amd64.s:1695
                Error:          Received unexpected error:
                                already borrowed: BorrowMutError
                Test:           Test/parallel_execution
                Messages:       ExecArtifact returns errors

Or crash like:

malloc_consolidate(): unaligned fastbin chunk detected
SIGABRT: abort
PC=0x7d947a8969fc m=9 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 14 gp=0xc000232fc0 m=9 mp=0xc000480008 [syscall]:
runtime.cgocall(0x8d1260, 0xc000496c68)
    /usr/lib/go-1.22/src/runtime/cgocall.go:157 +0x4b fp=0xc000496c40 sp=0xc000496c08 pc=0x40a9eb
kcl-lang.io/kcl-go/pkg/native._Cfunc_kclvm_service_call(0x7d941a996d76, 0x4934ce0, 0x7d9410000d50, 0x7d9411005560)
    _cgo_gotypes.go:106 +0x4c fp=0xc000496c68 sp=0xc000496c40 pc=0x8c4c2c
...

4. What is your KCL components version? (Required)

kcl-lang.io/kcl-go v0.8.2

Peefy commented 7 months ago

Sorry, I did not mark on this API that it is indeed not thread safe, this is a feature to be implemented.

bozaro commented 7 months ago

I tried to research this problem a little (and update testcase repo).

It looks like parallel execution breaks down in at least the following cases:

Peefy commented 7 months ago

You are right! They use the same address and global variables to share the KCL runtime, which currently does not support concurrent running. We are working hard to solve this problem, which is a bit tricky, similar to the parallelism issue of Python GIL and JVM, but it is expected that the 0.9 version KCL runtime will support concurrency.

bozaro commented 6 months ago

Since version 0.8.4+, the BuildProgram method has also became non-thread-safe. In version 0.8.3, the BuildProgram successfully works for us with parallelism. Unfortunatelly, I was unable to create an example for stable reproducing BuildProgram concurrency issue.

Peefy commented 6 months ago

Thank you for your feedback. I will investigate it.

I did modify the build logic before two versions, but added locks for different build threads. 😷

bozaro commented 6 months ago

Looks there are some memoty corruption between 0.8.3 and 0.8.4+. I add mutex to every BuildProgram, ParseFile and ExecArtifact call. And application still crash after 0.8.3 -> 0.8.4 upgrade.

Peefy commented 6 months ago

Thank you very much for your feedback. I will carefully review the recent updates to identify factors. It is expected that version 0.8.6 will be released next week.

bozaro commented 6 months ago

Looks there are some memoty corruption between 0.8.3 and 0.8.4+. I add mutex to every BuildProgram, ParseFile and ExecArtifact call. And application still crash after 0.8.3 -> 0.8.4 upgrade.

I found my crashes-after-upgrade root cause: https://github.com/kcl-lang/lib/issues/73 (I have locally applications with differ kcl version).

Peefy commented 5 months ago

Hello @bozaro

https://github.com/kcl-lang/kcl-go/blob/main/pkg/env/env.go#L77

We have introduced an experimental feature gate in kcl-go v0.9.0 beta.1

You can quickly run your code by opening it, without the need to call the BuildArtifact and ExecArtifact combination native APIs. You can directly call the ExecProgram API to achieve the higher performance, and it supports concurrency execution. Welcome to try it out.

bozaro commented 4 months ago

I tried the configuration in KCL_FAST_EVAL mode. It was not possible to carry out a full check, since it breaks on some structures that work without it.

I'll try to make short examples that don't work with KCL_FAST_EVAL, but a little later.

In terms of performance, the same configuration so far produces approximately the following figures:

In general, ExecArtifact on 0.9.0-beta.2 works about 2 times slower compared to 0.8.7. The compilation time of BuildProgram in 0.9.0-beta.2 has also increased.

I had hopes of doing parallel execution, when ExecArtifact is called sequentially within one .so file, but there are pool of these .so files. This trick works for a while, then it starts constantly giving the error already borrowed: BorrowMutError.

bozaro commented 4 months ago

I try regenrate all configuration with KCL_FAST_EVAL=1 on 3f2e611ac8fb11b7371bcbe83d273fbf997455c2 revision (current main branch head) and got the same output as before.

But in multithread mode I also got multiple errors like:

  |
4 |     regex.match(val, "^((http|https)://)[-a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]{1,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)$")
  |  already mutably borrowed: BorrowError
  |

Inside our lambda:

import regex

checkHttpUrl = lambda val {
    regex.match(val, "^((http|https)://)[-a-zA-Z0-9@:%._\\+~#?&//=]{2,256}\\.[a-z]{1,6}\\b([-a-zA-Z0-9@:%._\\+~#?&//=]*)$")
}
checkHostPort = lambda val {
    regex.match(val, "^[-a-z0-9._]{2,256}:\\d+$")
}
Peefy commented 4 months ago

Hello @bozaro Thanks for the feedback. Could you please give me a simple case that reproduce the multithread error and I will fix it recently and what the API is native.ExecProgram in Go SDK?

bozaro commented 4 months ago

I try remove all regex package usages and regenerate configuration with KCL_FAST_EVAL=1 on 3f2e611ac8fb11b7371bcbe83d273fbf997455c2 revision in multithread mode. Regeneration without regex usages passed succesfully.

Peefy commented 4 months ago

I try remove all regex package usages and regenerate configuration with KCL_FAST_EVAL=1 on 3f2e611 revision in multithread mode. Regeneration without regex usages passed succesfully.

I see. Thanks. I will try to fix it recently. I guess it's because the current runtime system library shares function addresses, maybe they should be initialized as thread_scope, and I will improve it as soon as possible.

bozaro commented 4 months ago

I try to add small testcases:

Peefy commented 4 months ago

I've given a try to fix it in PR https://github.com/kcl-lang/kcl/pull/1458 and add parallel tests in Rust and it ran successfully, and Go tests also ran successfully.

bozaro commented 4 months ago

I try regenerate all our configuration on https://github.com/kcl-lang/kcl/commit/9915cd47a466f42a8ce2ddb96ae20e56596c0067 revision in multithread mode:

From the outside, looks like that somewhere the compilation has a lock: it goes into single thread even for independent WorkDirs.

Peefy commented 4 months ago

Yes, there's a global build file lock currently to prevent common external module (e.g., kcl mod add k8s) link crash errors.

Peefy commented 3 months ago

JUST FYI

BuildProgram and ExecArtifact API will be annotated deprecated in v0.10.0

bozaro commented 3 months ago

I remove all locks in my code, upgrade to v0.10.0-alpha.1, switch to fast eval and it's works fine in multithread mode.

There is still a race in the plugin API (for example: https://github.com/kcl-lang/kcl-go/blob/e8708330af3f7e010f88dcd5f4062164f1999f6f/pkg/plugin/utils_c_string.go#L22-L36): there is no guarantee that the string will not be freed from the ring buffer before use. But this race has not affects me yet.

Peefy commented 3 months ago

there is no guarantee that the string will not be freed from the ring buffer before use. But this race has not affects me yet.

I see. Thank you! Do you have any good suggestions or changes regarding this? PRs welcome!

bozaro commented 3 months ago

I see. Thank you! Do you have any good suggestions or changes regarding this? PRs welcome!

I think this will require changing the Plugin API:

I once looked at how to add this feature to KCL, but then I lacked a general understanding of how KCL works (how generation, compilation and runtime are related). Perhaps I will try to return to this issue again a little later.

Peefy commented 2 months ago

Because this issue has not been updated for a long time, KCL Plugin can be further restructured and improved. I will close this PR and later open a new issue for tracking. If you have any questions, please feel free to reopen the issue or create a new issue or PR.