upkie / vulp

Robot/simulation switch for the mjbots stack
Apache License 2.0
61 stars 4 forks source link

[macOS] SpineInterface can't access the shared memory #92

Closed ubgk closed 1 month ago

ubgk commented 2 months ago

Bug description

SpineInterface cannot access the shared memory file, and my agent fails:

INFO: Analyzed target //agents/pid_balancer:pid_balancer (4 packages loaded, 13 targets configured).
INFO: Found 1 target...
Target //agents/pid_balancer:pid_balancer up-to-date:
  bazel-bin/agents/pid_balancer/pid_balancer
INFO: Elapsed time: 7.828s, Critical Path: 0.05s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/agents/pid_balancer/pid_balancer -c bullet
[2024-04-30 00:05:39,678] [info] Loading configuration 'config/common.gin' (main.py:78)
[2024-04-30 00:05:39,681] [info] Loading configuration 'config/bullet.gin' (main.py:78)
Waiting for spine /vulp to start...
FileNotFoundError: /vulp
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_bora/ee02e7f8ec3e022d620c305986e67335/execroot/upkie/bazel-out/darwin-opt/bin/agents/pid_balancer/pid_balancer.runfiles/upkie/agents/pid_balancer/main.py", line 105, in <module>
    spine = SpineInterface()
            ^^^^^^^^^^^^^^^^
  File "/private/var/tmp/_bazel_bora/ee02e7f8ec3e022d620c305986e67335/execroot/upkie/bazel-out/darwin-opt/bin/agents/pid_balancer/pid_balancer.runfiles/vulp/vulp/spine/spine_interface.py", line 79, in __init__
    shared_memory = wait_for_shared_memory(shm_name, retries)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/private/var/tmp/_bazel_bora/ee02e7f8ec3e022d620c305986e67335/execroot/upkie/bazel-out/darwin-opt/bin/agents/pid_balancer/pid_balancer.runfiles/vulp/vulp/spine/spine_interface.py", line 53, in wait_for_shared_memory
    raise SpineError(
vulp.spine.exceptions.SpineError: spine /vulp did not respond after 1 attempts

Expected behavior

When launching try_pid_balancer.sh, the agent should be able to access the shared-memory file and run normally.

Reproduction steps

Steps to reproduce the behavior:

  1. Run try_pid_balancer.sh on macOS (It works on Linux 20.04).

Additional context

The problem disappears once I checkout an earlier version of the spine_interface.py file, all other things remaining the same (e.g. git checkout b179d224da0468a93b6a9a051f6137b70f86498c -- vulp/spine/spine_interface.py).

I am creating this issue here as it seems to be caused by vulp.

I suspect the issue is related to the standard library (or the way we use it), as posix_ipc works just fine.

System

stephane-caron commented 1 month ago

It likely follows from https://github.com/upkie/vulp/releases/tag/v2.2.0 where we switched from posix_ipc to the Python standard library.

I'm surprised your log only attempts once to wait for the spine: in the current PID balancer, it should try 10 times. Maybe try increasing to retries=10?

ubgk commented 1 month ago

I think the problem is inconsistent handling of the POSIX naming standard on different platforms (and the sloppy handling of it by Python).

On Linux, shm_open('foo', ...), shm_open('\foo', ...) and shm_open('\\foo', ...) will all be resolved to the same object. (Is it because /dev/shm/foo == /dev/shm//foo ?)

It is, however, not the case on macOS, the object names should be an exact match. (Similarly, there is no /dev/shm/foo on macOS).

It seems that the standard SharedMemory module prepends the name with a slash: link. So when SpineInterface tries to open /vulp, it is in fact trying to open //vulp (which is OK on Linux, but not on macOS).

The simplest solution I can think of is to get rid of all leading slashes in our Python files. What do you think?

ubgk commented 1 month ago

I think the case makes itself for adding integration tests. 😅

Should I add try_pid_balancer.sh to the CI of upkie?

stephane-caron commented 1 month ago

Also, unit tests :wink: Would it be hard to make one for this particular case?

An integration test sounds good too :+1: We could add an option to try_pid_balancer.sh (forwarded to the Bullet spine) to stop after a fixed number of iterations, so that the test can run in an asynchronous and deterministic way. Can you propose a PR for this?

ubgk commented 1 month ago

Also, unit tests 😉 Would it be hard to make one for this particular case?

I cannot think of a very good case right off the bat. Maybe a Python test which creates a mock spine and tries to attach to it through SpineInterface could work, what do you think?

An integration test sounds good too 👍 We could add an option to try_pid_balancer.sh (forwarded to the Bullet spine) to stop after a fixed number of iterations, so that the test can run in an asynchronous and deterministic way. Can you propose a PR for this?

What you propose is an option and it should be easy to implement. I think a KISS alternative would be to just add a step like timeout N ./try_pid_balancer.sh (with a build step before, obviously). If the process quits for anything other than a timeout, we'll know that way. Does that sound good?

stephane-caron commented 1 month ago

Hmm, I would be wary of using timeout in a CI workflow. Things timing-related tend to break in CI runners, where time can be quite elastic. For instance here, the script may time out out before doing anything useful, with the test passing as a false positive.

Maybe a Python test which creates a mock spine and tries to attach to it through SpineInterface could work, what do you think?

That's what the test in spine_interface_test.py does :see_no_evil: The reason it passed is that it already had the correct syntax for macOS:

https://github.com/upkie/vulp/blob/870bb20cd8fecc3a890968c54f6159eedc5c9f8c/vulp/spine/tests/spine_interface_test.py#L30