ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.31k stars 1.2k forks source link

WIP | Refactor Docker and Dev Container setup using Buildkit #4392

Open ruffsl opened 1 month ago

ruffsl commented 1 month ago

TBD

mergify[bot] commented 1 month ago

This pull request is in conflict. Could you fix it @ruffsl?

tonynajjar commented 3 weeks ago

@ruffsl just FYI tried to run it and got:

0.367 E: Unable to locate package ros-rolling-nav2-minimal-tb3-sim 0.367 E: Unable to locate package ros-rolling-nav2-minimal-tb4-sim

ruffsl commented 3 weeks ago

@tonynajjar , yeah, looks like we have another un-released dependency back in our underlay.repos file:

tonynajjar commented 3 weeks ago

@ruffsl new error

2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-control-msgs/ros-rolling-control-msgs_5.1.0-1noble.20240429.102647_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-hardware-interface/ros-rolling-hardware-interface_4.11.0-1noble.20240514.082551_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-controller-interface/ros-rolling-controller-interface_4.11.0-1noble.20240514.083301_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-diff-drive-controller/ros-rolling-diff-drive-controller_4.8.0-1noble.20240514.114350_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-common-vendor/ros-rolling-gz-common-vendor_0.1.0-1noble.20240503.181130_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-msgs-vendor/ros-rolling-gz-msgs-vendor_0.1.0-1noble.20240503.181547_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-fuel-tools-vendor/ros-rolling-gz-fuel-tools-vendor_0.1.0-1noble.20240503.182511_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-rendering-vendor/ros-rolling-gz-rendering-vendor_0.1.0-1noble.20240507.212408_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-transport-vendor/ros-rolling-gz-transport-vendor_0.1.0-1noble.20240503.182514_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-gui-vendor/ros-rolling-gz-gui-vendor_0.1.0-1noble.20240507.214434_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-sdformat-vendor/ros-rolling-sdformat-vendor_0.1.0-1noble.20240503.181458_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-physics-vendor/ros-rolling-gz-physics-vendor_0.1.0-1noble.20240503.182124_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-sensors-vendor/ros-rolling-gz-sensors-vendor_0.1.0-1noble.20240507.214434_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-gz-sim-vendor/ros-rolling-gz-sim-vendor_0.1.0-1noble.20240507.215704_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
[2024-06-16T16:22:53.607Z] 2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-joint-state-broadcaster/ros-rolling-joint-state-broadcaster_4.8.0-1noble.20240514.114403_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-bridge/ros-rolling-ros-gz-bridge_1.0.0-1noble.20240507.145005_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-image/ros-rolling-ros-gz-image_1.0.0-1noble.20240507.151109_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
2.902 E: Failed to fetch http://packages.ros.org/ros2/ubuntu/pool/main/r/ros-rolling-ros-gz-sim/ros-rolling-ros-gz-sim_1.0.0-1noble.20240507.225051_amd64.deb  404  Not Found [IP: 140.211.166.134 80]
ruffsl commented 3 weeks ago

@tonynajjar , are you partially re-build the image from a prior cache? At present, the Dockerfile only apt updates once for the entire build.

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/.docker/Dockerfile#L28

This speeds up all the apt install steps, allows for later layers to be rebuilt offline if the local apt cache has already downloaded the debians, and ensures that all packages installed across the layers are originating from the same sync. But if there are debians versions you haven't downloaded locally, and not longer exist on the apt repo, then it's probably best rebuild the apt-update layer so all the following layers are on the same sync.

If the ros repos receive a new sync, then the apt list that was baked in the earlier layers can become stale, pointing to package version that the ros repos have since purged, as besides the ros snapshot repos, older packages are not yet archived.

So, we could either:

While I see there are snapshots for ROS 2 Jazzy, there doesn't seem to be any for Rolling:

We could also pin the rolling image by image ID/sha to automate cache busting via dependabot, though that needs some more work to complete the upstream docker build automation:

I think I may just go with the ENV ROS_SYNC_DATE= approach in the meantime for the local Dockerfile.

tonynajjar commented 3 weeks ago

I see, yes building without cache fixes it. On to the next error, basically all the nav2 packages are failing to build in the updateContentCommand because of this:

[2024-06-16T17:19:24.311Z] Failed   <<< nav2_velocity_smoother [0.00s, exited with code 1]
[2024-06-16T17:19:24.311Z] ]0;colcon cache [12/39 done] [1 ongoing]]0;colcon cache [13/39 done] [0 ongoing]Starting >>> nav2_costmap_2d
[2024-06-16T17:19:24.312Z] --- stderr: nav2_costmap_2d
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/colcon_core/executor/__init__.py", line 91, in __call__
    rc = await self.task(*args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/colcon_core/task/__init__.py", line 93, in __call__
    return await task_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/colcon_cache/task/lock/dirhash.py", line 179, in lock
    assert lockfile.lock_type == ENTRY_TYPE
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
---

It seems to be because of colcon cache lock

ruffsl commented 3 weeks ago

Looks like you may be trying to combine two different colcon cache lock files, either derived from get revision control hashes, or der hash that hashes the files directly. You could try deleting all the colcon cache lock files in the colcon build base path, or just delete the workspace volume, or rename it to make a new one, from the dev container config json.

On Sun, Jun 16, 2024, 12:24 PM Tony Najjar @.***> wrote:

I see, yes building without cache fixes it. On to the next error, basically all the nav2 packages are failing to build because of this:

[2024-06-16T17:19:24.311Z] Failed <<< nav2_velocity_smoother [0.00s, exited with code 1] [2024-06-16T17:19:24.311Z] �]0;colcon cache [12/39 done] [1 ongoing]��]0;colcon cache [13/39 done] [0 ongoing]�Starting >>> nav2_costmap_2d [2024-06-16T17:19:24.312Z] --- stderr: nav2_costmap_2d Traceback (most recent call last): File "/usr/lib/python3/dist-packages/colcon_core/executor/init.py", line 91, in call rc = await self.task(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/colcon_core/task/init.py", line 93, in call return await task_method(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/colcon_cache/task/lock/dirhash.py", line 179, in lock assert lockfile.lock_type == ENTRY_TYPE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError

am i doing something wrong?

— Reply to this email directly, view it on GitHub https://github.com/ros-navigation/navigation2/pull/4392#issuecomment-2171777898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARP6RMOSDWDLLAEV6GYIY3ZHXC4PAVCNFSM6AAAAABIWYVBKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRG43TOOBZHA . You are receiving this because you were mentioned.Message ID: @.***>

tonynajjar commented 3 weeks ago

rebuild dev container with --no-cache to ensure all packages are install form the same sync

With this you mean "Rebuild Container Without Cache"? it still seems to build with cache. Maybe because you're using image instead of Dockerfile in devcontainer.json

tonynajjar commented 3 weeks ago

Regarding the colcon cache, I cleaned out a bunch of things and it works now. I'll keep an eye out if it reproduces as part of a "normal workflow".

Can we somehow have the option to not rebuild the packages to save time since the image is build quite often? For me that's a big plus. I guess commenting out the updateContentCommand from the devcontainer would do it? I even think this should be the default. What do you think?

tonynajjar commented 3 weeks ago

bash: /usr/share/colcon_argcomplete/hook/colcon-argcomplete.bash: No such file or directory

FYI

ruffsl commented 3 weeks ago

bash: /usr/share/colcon_argcomplete/hook/colcon-argcomplete.bash: No such file or directory

Yeah, I filed a ticket for that earlier this week. Looks like it may be an upstream packaging issue for jazzy on noble:

Could you confirm by commenting on that ticket using the example?

ruffsl commented 3 weeks ago

I cleaned out a bunch of things and it works now. I'll keep an eye out if it reproduces as part of a "normal workflow".

My guess is that you tried re-using a colcon workspace built using the prior dev container setup. Prior, colcon cache was allowed to use whatever TaskExtensionPoint it preferred, with the GitLockTask given preference over DirhashLockTask, as re-using git to check the source state for package directories is faster, and allows ignoring files on a per repo bases via their own .gitignore config. However, it not as invariant, given that commit sha's can be different, even if the HEAD states are the same at a file system level. E.g. a change commit to a package followed by a revert commit for that change.

Internally, I've been using colcon-cache with projects that use git sub-modules, that more often encounter cases as mentioned. So to only use the dirhash approach, I've blocklisted the git task, so that changes are only tracked on a per file level, rather than a revision control history level. Not sure if it's still warranted here though, so I may revert this.

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/.docker/Dockerfile#L112

In any case, when you mix & match lockfile types from different tasks, colcon-cache raises an error for such inconsistencies.

ruffsl commented 3 weeks ago

Can we somehow have the option to not rebuild the packages to save time since the image is build quite often? For me that's a big plus.

You can build up to any stage in the Dockerfile by passing it as the target name for the bake command. All the stages in the Dockerfile currently have respective bake targets in the bake file. E.g: building only up to the tooler stage will not invoke the build directives that then commence the colcon build commands for the builder stage:

docker buildx bake tooler

I guess commenting out the updateContentCommand from the devcontainer would do it?

If by "rebuilding", you mean rebuilding the dev container (rather than merely the docker image), then yes, you could also just comment out the updateContentCommand life cycle script, or do what I do and just comment out the final colcon build line in that script.

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/.devcontainer/update-content-command.sh#L59-L62

Then, the container on startup still prints out ENVs describing the state of validity between the cached colcon workspace vs the current source checkout of nav2, which I find useful as a reminder indicating what I need to rebuild because of what has changed since I last rebuilt the workspace inside the named volume currently mounted.

I even think this should be the default. What do you think?

I had it build the workspace by default to onboard novice students with as few steps on their part as possible. All they need to do is start the dev container rebuild and walk away for some coffee, while the script will attempt to cache what it can. Very helpful when someone is just starting out and simply wants to see nav2 in action via a gazebo simulation to know what is possible.

You and I or other experience maintainers can just manually edit the life cycle script to fit our personal preferences and dev container behaviors, then use something like git worktrees to keep track of and checkout our own customizations to just the .devcontainer folder:

ruffsl commented 3 weeks ago

With this you mean "Rebuild Container Without Cache"? it still seems to build with cache. Maybe because you're using image instead of Dockerfile in devcontainer.json

Well, instead of specifying the Dockerfile in the dev container config, a static docker tag is used to specify what docker image is to be used as the bases for running the resulting dev container from. This static docker tag is built and tagged by the initializeCommand life cycle script, that in tern calls the docker bake command.

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/.devcontainer/devcontainer.json#L3-L4

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/.devcontainer/initialize-command.sh#L13-L17

This is primarily because most of the advance buildkit features are made more ergonomic to configure via bake files, however the dev container spec does not yet nativity support such bake files, so the initializeCommand provides a suitable workaround, while also being much more customizable. E.g. we could add custom logic on how to rebuild the image under different conditions.

What exact stages we want to cache or bust can of course be further controlled via the bake file itself:

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/docker-bake.hcl#L20

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/docker-bake.hcl#L73-L74

https://github.com/ros-navigation/navigation2/blob/18bf8a46cf0c0db8fe06525e06bf967dafb1f26b/docker-bake.hcl#L96-L97

ruffsl commented 23 hours ago

[planner_server-10] [WARN] [1720332420.098450618] [planner_server]: GridBased plugin failed to plan from (-2.00, -0.50) to (100.00, 100.00): "Goal Coordinates of(100.000000, 100.000000) was outside bounds"

@SteveMacenski , Is there some kind of floating point precision issue with the bounds check here? Just trying to get the new CI to roll over completely.