langgenius / dify-sandbox

A lightweight, fast, and secure code execution environment that supports multiple programming languages
https://docs.dify.ai/development/backend/sandbox
Apache License 2.0
412 stars 93 forks source link

build amd64 docker and run main, installed numpy with error to import #12

Closed dafang closed 3 months ago

dafang commented 3 months ago

use the amd64 dockerfile build the docker image and run the main file after installed numpy, then run the code to import numpy, got following error:

Importing the numpy C-extensions failed.
Original error was: libgcc_s.so.1: cannot open shared object file: No such file or directory
dafang commented 3 months ago

the docker file:

FROM python:3.10-slim

RUN apt-get clean && apt-get update && apt-get install -y gcc pkg-config libseccomp-dev wget xz-utils
RUN apt-get install -y gcc-multilib
# copy main binary to /main
COPY main /main
COPY requirements.txt /requirements.txt
COPY conf/config.yaml /conf/config.yaml
RUN rm -rf /var/lib/apt/lists/* \
    && chmod +x /main \
    && pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple jinja2 requests httpx PySocks httpx[socks] \
    && pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt \
    && wget -O /opt/node-v20.11.1-linux-x64.tar.xz https://npmmirror.com/mirrors/node/v20.11.1/node-v20.11.1-linux-x64.tar.xz \
    && tar -xvf /opt/node-v20.11.1-linux-x64.tar.xz -C /opt \
    && ln -s /opt/node-v20.11.1-linux-x64/bin/node /usr/local/bin/node \
    && rm -f /opt/node-v20.11.1-linux-x64.tar.xz

ENTRYPOINT ["/main"]

the requirements.txt:

aiohttp==3.8.6 ; python_version >= "3.10" and python_version < "4.0"
aiohttp[speedups]==3.8.6 ; python_version >= "3.10" and python_version < "4.0"
click==8.1.7 ; python_version >= "3.10" and python_version < "4.0"
markdown==3.5.2 ; python_version >= "3.10" and python_version < "4.0"
pypdf==3.17.4 ; python_version >= "3.10" and python_version < "4.0"
numpy==1.23.5 ; python_version >= "3.10" and python_version < "4.0"
Yeuoly commented 3 months ago

I guess you need to add this shared library there https://github.com/langgenius/dify-sandbox/blob/main/internal/static/config_default_amd64.go as numpy depends on this C-extension but it has not been copied into isolation environments.

BTW, do other libraries work well?

dafang commented 3 months ago

this partially works.

after added "/usr/lib/x86_64-linux-gnu/libgcc_s.so.1", can import numpy but with permission error, I scan through the code, seems it was blocked by the seccomp, so after I temp disabled the seccomp, it works. (we run the code interpreter in severless env, so the seccomp is not necessary, but I will continue to fig out which syscall is required by numpy...)

another lib, I added is the "/usr/lib/x86_64-linux-gnu/librt.so.1", this is the "so" depends by pydantic.

dafang commented 3 months ago
  1. write one test python file, for example: test_numpy.py and just add one line of code to import numpy:
import numpy as np
  1. use strace to log all the syscalls:
strace -o strace_output.txt -e trace=all python test_numpy.py
  1. then use awk and sort to print all the syscalls:
awk '{print $1}' strace_output.txt | sed 's/[(].*//' | sort | uniq -c | sort -nr

then, got the list of syscalls, diff and add:

    831 stat
    418 fstat
    393 read
    337 lseek
    278 openat
    250 close
    215 mmap
    180 ioctl
     68 rt_sigaction
     60 mprotect
     54 getdents64
     42 brk
     35 futex
     18 pread64
     17 munmap
      7 clone
      6 lstat
      4 readlink
      3 uname
      3 dup
      2 shmget
      2 getuid
      2 getgid
      2 geteuid
      2 getegid
      2 getcwd
      2 arch_prctl
      1 sysinfo
      1 shmdt
      1 shmat
      1 set_tid_address
      1 set_robust_list
      1 sched_getaffinity
      1 rt_sigprocmask
      1 prlimit64
      1 gettid
      1 fcntl
      1 exit_group
      1 execve
      1 epoll_create1
      1 access
dafang commented 3 months ago

but, unfortunately, after added all the syscalls, still got the error response:

{"code":0,"message":"success","data":{"error":"OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k error: operation not permitted exit_string: signal: bad system call\n","stdout":""}}

the "bad system call" is error message added by me...

still digging the reason..

Yeuoly commented 3 months ago

but, unfortunately, after added all the syscalls, still got the error response:

{"code":0,"message":"success","data":{"error":"OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k error: operation not permitted exit_string: signal: bad system call\n","stdout":""}}

the "bad system call" is error message added by me...

still digging the reason..

That's a bad news, but I noticed that some of syscalls you added is already exist in allowed_syscalls such as SYS_BRK, SYS_OPENAT, maybe strace -o strace_output.txt -e trace=all python test_numpy.py could be optimized.

Yeuoly commented 3 months ago
  1. write one test python file, for example: test_numpy.py and just add one line of code to import numpy:
import numpy as np
  1. use strace to log all the syscalls:
strace -o strace_output.txt -e trace=all python test_numpy.py
  1. then use awk and sort to print all the syscalls:
awk '{print $1}' strace_output.txt | sed 's/[(].*//' | sort | uniq -c | sort -nr

then, got the list of syscalls, diff and add:

    831 stat
    418 fstat
    393 read
    337 lseek
    278 openat
    250 close
    215 mmap
    180 ioctl
     68 rt_sigaction
     60 mprotect
     54 getdents64
     42 brk
     35 futex
     18 pread64
     17 munmap
      7 clone
      6 lstat
      4 readlink
      3 uname
      3 dup
      2 shmget
      2 getuid
      2 getgid
      2 geteuid
      2 getegid
      2 getcwd
      2 arch_prctl
      1 sysinfo
      1 shmdt
      1 shmat
      1 set_tid_address
      1 set_robust_list
      1 sched_getaffinity
      1 rt_sigprocmask
      1 prlimit64
      1 gettid
      1 fcntl
      1 exit_group
      1 execve
      1 epoll_create1
      1 access

Maybe you can refer to https://github.com/langgenius/dify-sandbox/blob/main/cmd/test/fuzz_nodejs_amd64/main.go, you can set a range of syscalls from 0 to 400 on line 57, and see if errors raise, if not, it means all necessary syscalls are permitted, then you can reduce it to 0\~200 or 200\~400, continue this process, until you found the syscall which is needed.

dafang commented 3 months ago

but, unfortunately, after added all the syscalls, still got the error response:

{"code":0,"message":"success","data":{"error":"OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k error: operation not permitted exit_string: signal: bad system call\n","stdout":""}}

the "bad system call" is error message added by me... still digging the reason..

That's a bad news, but I noticed that some of syscalls you added is already exist in allowed_syscalls such as SYS_BRK, SYS_OPENAT, maybe strace -o strace_output.txt -e trace=all python test_numpy.py could be optimized.

Yes, this is what I have done. I ready filtered out what you have added. Will try following your test case. Tks

dafang commented 3 months ago
  1. write one test python file, for example: test_numpy.py and just add one line of code to import numpy:
import numpy as np
  1. use strace to log all the syscalls:
strace -o strace_output.txt -e trace=all python test_numpy.py
  1. then use awk and sort to print all the syscalls:
awk '{print $1}' strace_output.txt | sed 's/[(].*//' | sort | uniq -c | sort -nr

then, got the list of syscalls, diff and add:

    831 stat
    418 fstat
    393 read
    337 lseek
    278 openat
    250 close
    215 mmap
    180 ioctl
     68 rt_sigaction
     60 mprotect
     54 getdents64
     42 brk
     35 futex
     18 pread64
     17 munmap
      7 clone
      6 lstat
      4 readlink
      3 uname
      3 dup
      2 shmget
      2 getuid
      2 getgid
      2 geteuid
      2 getegid
      2 getcwd
      2 arch_prctl
      1 sysinfo
      1 shmdt
      1 shmat
      1 set_tid_address
      1 set_robust_list
      1 sched_getaffinity
      1 rt_sigprocmask
      1 prlimit64
      1 gettid
      1 fcntl
      1 exit_group
      1 execve
      1 epoll_create1
      1 access

Maybe you can refer to https://github.com/langgenius/dify-sandbox/blob/main/cmd/test/fuzz_nodejs_amd64/main.go, you can set a range of syscalls from 0 to 400 on line 57, and see if errors raise, if not, it means all necessary syscalls are permitted, then you can reduce it to 0~200 or 200~400, continue this process, until you found the syscall which is needed.

Good to start, I modified your test.py, without luck:

  1. I added the allowed syscalls in the begin of the test code: os.environ["ALLOWED_SYSCALLS"] = ",".join([str(i) for i in range(303)]) 302 is the biggest syscall num
  2. at the end of the test code, add import numpy as np, failed with "Bad system call"

Actually, if I didn't add the import numpy, still fail with bad system call, I found that it is caused by the base64 import, so I commented it out, then success.

Not sure whether it is caused by others.

My testing PC is alicloud ECS:

test.py

import ctypes
import json
import os
import sys
import traceback

os.environ["ALLOWED_SYSCALLS"] = ",".join([str(i) for i in range(303)]) # added by me

# setup sys.excepthook
def excepthook(type, value, tb):
    sys.stderr.write("".join(traceback.format_exception(type, value, tb)))
    sys.stderr.flush()
    sys.exit(-1)

sys.excepthook = excepthook

lib = ctypes.CDLL("/var/sandbox/sandbox-python/python.so")
lib.DifySeccomp.argtypes = [ctypes.c_uint32, ctypes.c_uint32, ctypes.c_bool]
lib.DifySeccomp.restype = None

import json
import os
import sys
import traceback

os.chdir("/var/sandbox/sandbox-python")

lib.DifySeccomp(65537, 1001, 1)

# declare main function here
def main() -> dict:
    return {"message": [1, 2, 3]}

# from base64 import b64decode
from json import dumps, loads

# execute main function, and return the result
# inputs is a dict, and it
# inputs = b64decode("e30=").decode("utf-8")
output = main()

# convert output to json and print
output = dumps(output, indent=4)

result = f"""<<RESULT>>
{output}
<<RESULT>>"""
print(result)
print(os.environ["ALLOWED_SYSCALLS"])

import numpy as np

print(np.version.full_version)

You can try @Yeuoly

dafang commented 3 months ago

The above test.py result:

image

dafang commented 3 months ago

After debugging for some cases, I found these MAY BE the bugs:

  1. InitSeccomp is called through Python prescript.py code, which causes that, in the InitSeccomp func, the logic of allowed_syscall := os.Getenv("ALLOWED_SYSCALLS") is illegal and allowed_syscall is empty. Following logics are by passed: image

  2. Even if changed the hard coded ALLOW_SYSCALLS to the full list of syscalls, still run into the "bad system call" error:

    var ALLOW_SYSCALLS = []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302}
dafang commented 3 months ago

Below is the log I printed in the InitSeccomp func:

{"code":0,"message":"success","data":{"error":"OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k\nerror: operation not permitted\nexit_string: signal: bad system call\n","stdout":"2024/07/15 23:55:51 add_seccomp.go:46: [WARN]## allowed syscalls: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302]\n"}}
Yeuoly commented 3 months ago

After debugging for some cases, I found these MAY BE the bugs:

  1. InitSeccomp is called through Python prescript.py code, which causes that, in the InitSeccomp func, the logic of allowed_syscall := os.Getenv("ALLOWED_SYSCALLS") is illegal and allowed_syscall is empty. Following logics are by passed: image
  2. Even if changed the hard coded ALLOW_SYSCALLS to the full list of syscalls, still run into the "bad system call" error:
var ALLOW_SYSCALLS = []int{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302}

This logic is for debug only and it will never be used in production environment, as for base64 encoding, it should works, I have setup CI tests for this, all checks passed, maybe some syscalls are missing?

Yeuoly commented 3 months ago
  1. write one test python file, for example: test_numpy.py and just add one line of code to import numpy:
import numpy as np
  1. use strace to log all the syscalls:
strace -o strace_output.txt -e trace=all python test_numpy.py
  1. then use awk and sort to print all the syscalls:
awk '{print $1}' strace_output.txt | sed 's/[(].*//' | sort | uniq -c | sort -nr

then, got the list of syscalls, diff and add:

    831 stat
    418 fstat
    393 read
    337 lseek
    278 openat
    250 close
    215 mmap
    180 ioctl
     68 rt_sigaction
     60 mprotect
     54 getdents64
     42 brk
     35 futex
     18 pread64
     17 munmap
      7 clone
      6 lstat
      4 readlink
      3 uname
      3 dup
      2 shmget
      2 getuid
      2 getgid
      2 geteuid
      2 getegid
      2 getcwd
      2 arch_prctl
      1 sysinfo
      1 shmdt
      1 shmat
      1 set_tid_address
      1 set_robust_list
      1 sched_getaffinity
      1 rt_sigprocmask
      1 prlimit64
      1 gettid
      1 fcntl
      1 exit_group
      1 execve
      1 epoll_create1
      1 access

Maybe you can refer to https://github.com/langgenius/dify-sandbox/blob/main/cmd/test/fuzz_nodejs_amd64/main.go, you can set a range of syscalls from 0 to 400 on line 57, and see if errors raise, if not, it means all necessary syscalls are permitted, then you can reduce it to 0~200 or 200~400, continue this process, until you found the syscall which is needed.

Good to start, I modified your test.py, without luck:

  1. I added the allowed syscalls in the begin of the test code: os.environ["ALLOWED_SYSCALLS"] = ",".join([str(i) for i in range(303)]) 302 is the biggest syscall num
  2. at the end of the test code, add import numpy as np, failed with "Bad system call"

Actually, if I didn't add the import numpy, still fail with bad system call, I found that it is caused by the base64 import, so I commented it out, then success.

Not sure whether it is caused by others.

My testing PC is alicloud ECS:

  • Linux dev-ecs 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • 8 core 32G memory
  • python is 3.10, under conda env
  • Go: go version go1.21.6 linux/amd64 // I can build and run the main entrance

test.py

import ctypes
import json
import os
import sys
import traceback

os.environ["ALLOWED_SYSCALLS"] = ",".join([str(i) for i in range(303)]) # added by me

# setup sys.excepthook
def excepthook(type, value, tb):
    sys.stderr.write("".join(traceback.format_exception(type, value, tb)))
    sys.stderr.flush()
    sys.exit(-1)

sys.excepthook = excepthook

lib = ctypes.CDLL("/var/sandbox/sandbox-python/python.so")
lib.DifySeccomp.argtypes = [ctypes.c_uint32, ctypes.c_uint32, ctypes.c_bool]
lib.DifySeccomp.restype = None

import json
import os
import sys
import traceback

os.chdir("/var/sandbox/sandbox-python")

lib.DifySeccomp(65537, 1001, 1)

# declare main function here
def main() -> dict:
    return {"message": [1, 2, 3]}

# from base64 import b64decode
from json import dumps, loads

# execute main function, and return the result
# inputs is a dict, and it
# inputs = b64decode("e30=").decode("utf-8")
output = main()

# convert output to json and print
output = dumps(output, indent=4)

result = f"""<<RESULT>>
{output}
<<RESULT>>"""
print(result)
print(os.environ["ALLOWED_SYSCALLS"])

import numpy as np

print(np.version.full_version)

You can try @Yeuoly

The syscall number 302 is not the highest, there are nearly 400 syscall numbers, but in Go, they are only defined up to 302.

dafang commented 3 months ago

You are right, after defined to 500, it finally works. So next step to figure out which syscall is the easy way. Thanks.

dafang commented 3 months ago

/close

Yeuoly commented 3 months ago

BTW, are you interested in contribute this to main branch?

dafang commented 3 months ago

BTW, are you interested in contribute this to main branch?

Sure, will give you the PR later.