Investigate TVM playing nice with Docker

ninehusky commented 2 years ago

docker build --tag glenside .
docker run glenside cargo test --no-default-features --features tvm

On a clean copy of the repository, running the commands above works on the first iteration.

However, subsequent runs of the test suite produce the following output:

failures:

---- codegen::tests::relay_op_softmax stdout ----
thread 'codegen::tests::relay_op_softmax' panicked at 'Running Relay code failed with code Some(1).
stdout:

stderr:
Traceback (most recent call last):
  File "/root/glenside/src/language/from_relay/run_relay.py", line 42, in <module>
    output = relay.create_executor(mod=expr, kind="graph").evaluate()(*inputs)
  File "/root/tvm/python/tvm/relay/backend/interpreter.py", line 172, in evaluate
    return self._make_executor()
  File "/root/tvm/python/tvm/relay/build_module.py", line 395, in _make_executor
    mod = build(self.mod, target=self.target)
  File "/root/tvm/python/tvm/relay/build_module.py", line 277, in build
    tophub_context = autotvm.tophub.context(list(target.values()))
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 116, in context
    if not check_backend(tophub_location, name):
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 158, in check_backend
    download_package(tophub_location, package_name)
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 184, in download_package
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/root/.tvm'
', src/codegen.rs:1836:9

---- codegen::tests::relay_op_relu stdout ----
thread 'codegen::tests::relay_op_relu' panicked at 'Running Relay code failed with code Some(1).
stdout:

stderr:
Traceback (most recent call last):
  File "/root/glenside/src/language/from_relay/run_relay.py", line 42, in <module>
    output = relay.create_executor(mod=expr, kind="graph").evaluate()(*inputs)
  File "/root/tvm/python/tvm/relay/backend/interpreter.py", line 172, in evaluate
    return self._make_executor()
  File "/root/tvm/python/tvm/relay/build_module.py", line 395, in _make_executor
    mod = build(self.mod, target=self.target)
  File "/root/tvm/python/tvm/relay/build_module.py", line 277, in build
    tophub_context = autotvm.tophub.context(list(target.values()))
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 116, in context
    if not check_backend(tophub_location, name):
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 158, in check_backend
    download_package(tophub_location, package_name)
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 184, in download_package
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/root/.tvm/tophub'
', src/codegen.rs:1836:9

---- codegen::tests::relay_op_batchflatten stdout ----
thread 'codegen::tests::relay_op_batchflatten' panicked at 'Running Relay code failed with code Some(1).
stdout:

stderr:
Traceback (most recent call last):
  File "/root/glenside/src/language/from_relay/run_relay.py", line 42, in <module>
    output = relay.create_executor(mod=expr, kind="graph").evaluate()(*inputs)
  File "/root/tvm/python/tvm/relay/backend/interpreter.py", line 172, in evaluate
    return self._make_executor()
  File "/root/tvm/python/tvm/relay/build_module.py", line 395, in _make_executor
    mod = build(self.mod, target=self.target)
  File "/root/tvm/python/tvm/relay/build_module.py", line 277, in build
    tophub_context = autotvm.tophub.context(list(target.values()))
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 116, in context
    if not check_backend(tophub_location, name):
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 158, in check_backend
    download_package(tophub_location, package_name)
  File "/root/tvm/python/tvm/autotvm/tophub.py", line 184, in download_package
    os.mkdir(path)
FileExistsError: [Errno 17] File exists: '/root/.tvm/tophub'
', src/codegen.rs:1836:9

failures:
    codegen::tests::relay_op_batchflatten
    codegen::tests::relay_op_relu
    codegen::tests::relay_op_softmax

test result: FAILED. 300 passed; 3 failed; 8 ignored; 0 measured; 0 filtered out; finished in 33.86s

Sometimes, clearing the Docker cache and rebuilding the image can fix this issue, but it doesn't always fix it for some reason.

We should look into this!

gussmith23 commented 2 years ago

@ninehusky can you see what happens when you run the tests on a single thread? See: https://doc.rust-lang.org/book/ch11-02-running-tests.html#running-tests-in-parallel-or-consecutively

I suspect what is happening is this: cargo test runs tests in parallel. Multiple tests which use TVM get started at the same time. When TVM gets used for the first time by these tests, it does some kind of initialization in which it initializes the /root/.tvm/tophub directory. So when multiple tests trigger this initialization in parallel, there's a race condition to see which thread creates the directory first.

If that's the case, we'll probably need to find a way to trigger that setup before running the tests.

gussmith23 commented 2 years ago

Oh, lol, this has already been fixed: https://github.com/apache/tvm/commit/bf20107ffe6e96e20125a2209500668777095337

I was looking in the tophub.py file from which the error is triggered. It seemed like the error had been anticipated and fixed, though, so I checked the git blame and found the above PR, in which someone fixed the issue.

So to fix this issue we should just need to update TVM. This may be an easy fix; I'll give it a go right now.

gussmith23 / glenside

Investigate TVM playing nice with Docker #163