regro / cf-scripts

Flagship repo for cf-regro-autotick-bot
Other
47 stars 72 forks source link

petsc rerender error #2561

Open minrk opened 4 months ago

minrk commented 4 months ago

rerenders of petsc have started to fail, I believe starting with the merge of https://github.com/conda-forge/petsc-feedstock/pull/193, which adds cuda builds.

The bot error doesn't show any actual errors and a manual rerender with the migrator succeeds, so I suspect it could be a memory exhaustion issue.

bot error ``` bot error ( [bot CI job](https://github.com/regro/cf-scripts/actions/runs/8986224807) ): main: Traceback (most recent call last): File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/auto_tick.py", line 1271, in _run_migrator migrator_uid, pr_json = run( ^^^^ File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/auto_tick.py", line 284, in run rerender_msg = rerender_feedstock(feedstock_dir, timeout=900) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/rerender_feedstock.py", line 49, in rerender_feedstock return rerender_feedstock_containerized( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/rerender_feedstock.py", line 102, in rerender_feedstock_containerized data = run_container_task( ^^^^^^^^^^^^^^^^^^^ File "/home/runner/work/cf-scripts/cf-scripts/cf-scripts/conda_forge_tick/utils.py", line 229, in run_container_task raise RuntimeError( RuntimeError: Error running rerender-feedstock in container - error RuntimeError("Failed to rerender.\noutput: Adding in variants from internal_defaults\nAdding in variants from /tmp/tmp76lv6jme/conda-smithy/conda_build_config.yaml\nAdding in variants from /tmp/tmpiph3pba7/petsc-feedstock/recipe/conda_build_config.yaml\nAdding in variants from argument_variants\nINFO:conda_smithy.configure_feedstock:Downloading conda-forge-pinning-2024.05.07.12.03.23\nINFO:conda_smithy.configure_feedstock:Extracting conda-forge-pinning to /tmp/tmp76lv6jme/conda-smithy\nINFO:conda_smithy.configure_feedstock:__pycache__ rendering is skipped\nINFO:conda_smithy.configure_feedstock:README rendering is skipped\nINFO:conda_smithy.configure_feedstock:suitesparse7.yaml from feedstock is ignored and upstream version is used\nINFO:conda_smithy.configure_feedstock:mumps_mpi571.yaml from feedstock is ignored and upstream version is used\nWARNING: Setting build platform. This is only useful when pretending to be on another platform, such as for rendering necessary dependencies on a non-native platform. I trust that you know what you're doing.\nWARNING: Setting build arch. This is only useful when pretending to be on another arch, such as for rendering necessary dependencies on a non-native arch. I trust that you know what you're doing.\nWARNING: No numpy version specified in conda_build_config.yaml. Falling back to default numpy value of 1.23\nINFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is being passed through with value 'x86_64-linux'. If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.\n warnings.warn(\nINFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is being passed through with value 'sbsa-linux'. If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.\n warnings.warn(\nINFO:conda_smithy.configure_feedstock:Applying migrations: /tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is being passed through with value 'ppc64le-linux'. If you are splitting build and test phases with --no-test, please ensure that this value is also set similarly at test time.\n warnings.warn(\n\n") raised: cmd: ['docker', 'run', '-e', 'CF_TICK_IN_CONTAINER=true', '--security-opt=no-new-privileges', '--read-only', '--cap-drop=all', '--mount', 'type=tmpfs,destination=/tmp,tmpfs-mode=1777,tmpfs-size=6000000000', '-m', '6000m', '--cpus', '1', '--ulimit', 'nofile=1024:1024', '--ulimit', 'nproc=2048:2048', '--rm', '-i', '--mount', 'type=bind,source=/tmp/tmp2sh_59zq,destination=/cf_tick_dir', 'ghcr.io/regro/conda-forge-tick:master', '/opt/conda/envs/cf-scripts/bin/python', '-u', '/opt/autotick-bot/docker/run_bot_task.py', 'rerender-feedstock', '--timeout', '900', '--log-level', 'info'] stdout: {'data': None, 'error': 'RuntimeError("Failed to rerender.\\noutput: Adding in variants from ' 'internal_defaults\\nAdding in variants from ' '/tmp/tmp76lv6jme/conda-smithy/conda_build_config.yaml\\nAdding in ' 'variants from ' '/tmp/tmpiph3pba7/petsc-feedstock/recipe/conda_build_config.yaml\\nAdding ' 'in variants from ' 'argument_variants\\nINFO:conda_smithy.configure_feedstock:Downloading ' 'conda-forge-pinning-2024.05.07.12.03.23\\nINFO:conda_smithy.configure_feedstock:Extracting ' 'conda-forge-pinning to ' '/tmp/tmp76lv6jme/conda-smithy\\nINFO:conda_smithy.configure_feedstock:__pycache__ ' 'rendering is skipped\\nINFO:conda_smithy.configure_feedstock:README ' 'rendering is ' 'skipped\\nINFO:conda_smithy.configure_feedstock:suitesparse7.yaml ' 'from feedstock is ignored and upstream version is ' 'used\\nINFO:conda_smithy.configure_feedstock:mumps_mpi571.yaml from ' 'feedstock is ignored and upstream version is used\\nWARNING: ' 'Setting build platform. This is only useful when pretending to be ' 'on another platform, such as for rendering necessary dependencies ' "on a non-native platform. I trust that you know what you're " 'doing.\\nWARNING: Setting build arch. This is only useful when ' 'pretending to be on another arch, such as for rendering necessary ' 'dependencies on a non-native arch. I trust that you know what ' "you're doing.\\nWARNING: No numpy version specified in " 'conda_build_config.yaml. Falling back to default numpy value of ' '1.23\\nINFO:conda_smithy.configure_feedstock:Applying migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is " "being passed through with value 'x86_64-linux'. If you are " 'splitting build and test phases with --no-test, please ensure that ' 'this value is also set similarly at test time.\\n ' 'warnings.warn(\\nINFO:conda_smithy.configure_feedstock:Applying ' 'migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is " "being passed through with value 'sbsa-linux'. If you are splitting " 'build and test phases with --no-test, please ensure that this value ' 'is also set similarly at test time.\\n ' 'warnings.warn(\\nINFO:conda_smithy.configure_feedstock:Applying ' 'migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\\n/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' is " "being passed through with value 'ppc64le-linux'. If you are " 'splitting build and test phases with --no-test, please ensure that ' 'this value is also set similarly at test time.\\n ' 'warnings.warn(\\n\\n")', 'timeout': 900, 'traceback': 'Traceback (most recent call last):\n' ' File "/opt/autotick-bot/docker/run_bot_task.py", line 118, in ' '_run_bot_task\n' ' data = func(**kwargs)\n' ' ^^^^^^^^^^^^^^\n' ' File "/opt/autotick-bot/docker/run_bot_task.py", line 239, in ' '_rerender_feedstock\n' ' msg = rerender_feedstock_local(fs_dir, **kwargs)\n' ' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n' ' File ' '"/opt/autotick-bot/conda_forge_tick/rerender_feedstock.py", ' 'line 239, in rerender_feedstock_local\n' ' raise RuntimeError(f"Failed to rerender.\\noutput: ' '{ret.stdout}\\n")\n' 'RuntimeError: Failed to rerender.\n' 'output: Adding in variants from internal_defaults\n' 'Adding in variants from ' '/tmp/tmp76lv6jme/conda-smithy/conda_build_config.yaml\n' 'Adding in variants from ' '/tmp/tmpiph3pba7/petsc-feedstock/recipe/conda_build_config.yaml\n' 'Adding in variants from argument_variants\n' 'INFO:conda_smithy.configure_feedstock:Downloading ' 'conda-forge-pinning-2024.05.07.12.03.23\n' 'INFO:conda_smithy.configure_feedstock:Extracting ' 'conda-forge-pinning to /tmp/tmp76lv6jme/conda-smithy\n' 'INFO:conda_smithy.configure_feedstock:__pycache__ rendering is ' 'skipped\n' 'INFO:conda_smithy.configure_feedstock:README rendering is ' 'skipped\n' 'INFO:conda_smithy.configure_feedstock:suitesparse7.yaml from ' 'feedstock is ignored and upstream version is used\n' 'INFO:conda_smithy.configure_feedstock:mumps_mpi571.yaml from ' 'feedstock is ignored and upstream version is used\n' 'WARNING: Setting build platform. This is only useful when ' 'pretending to be on another platform, such as for rendering ' 'necessary dependencies on a non-native platform. I trust that ' "you know what you're doing.\n" 'WARNING: Setting build arch. This is only useful when ' 'pretending to be on another arch, such as for rendering ' 'necessary dependencies on a non-native arch. I trust that you ' "know what you're doing.\n" 'WARNING: No numpy version specified in ' 'conda_build_config.yaml. Falling back to default numpy value ' 'of 1.23\n' 'INFO:conda_smithy.configure_feedstock:Applying migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n' '/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' " "is being passed through with value 'x86_64-linux'. If you are " 'splitting build and test phases with --no-test, please ensure ' 'that this value is also set similarly at test time.\n' ' warnings.warn(\n' 'INFO:conda_smithy.configure_feedstock:Applying migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n' '/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' " "is being passed through with value 'sbsa-linux'. If you are " 'splitting build and test phases with --no-test, please ensure ' 'that this value is also set similarly at test time.\n' ' warnings.warn(\n' 'INFO:conda_smithy.configure_feedstock:Applying migrations: ' '/tmp/tmpiph3pba7/petsc-feedstock/.ci_support/migrations/cuda120.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/suitesparse7.yaml,/tmp/tmp76lv6jme/conda-smithy/share/conda-forge/migrations/mumps_mpi571.yaml\n' '/opt/conda/envs/cf-scripts/lib/python3.11/site-packages/conda_build/environ.py:558: ' "UserWarning: The environment variable 'CUDA_CONDA_TARGET_NAME' " "is being passed through with value 'ppc64le-linux'. If you are " 'splitting build and test phases with --no-test, please ensure ' 'that this value is also set similarly at test time.\n' ' warnings.warn(\n' '\n' '\n'} ```
beckermr commented 4 months ago

How long does the local rerender take? I wonder if the timeout is the issue. The bot kills the job at 900 seconds.

minrk commented 4 months ago

timeout makes sense, it takes 600 seconds on my m1 mac.

minrk commented 4 months ago

petsc suitesparse migration rerender succeeded after failing for a while, and it looks like it took 630 seconds. mumps also unstuck, so not sure if there's still an action to take here?

It could be really useful if there were a way to manually issue PRs that the migrator recognizes as part of the migration when the migrator fails.

beckermr commented 4 months ago

Yes, or we could have the bot just issue the PR and make a rerender comment in the PR. I don't want to do that in general so that we don't overload GHA, but it might be ok here.

minrk commented 2 months ago

FWIW, this is happening again in petsc4py, which currently has 260(!) variants and takes over half an hour on my machine to rerender (it takes less time to build all variants in parallel on CI than to rerender with no changes). Unfortunately, ~all of the time is in render_recipe which happens before evaluating meta.skip(), so adding skips doesn't meaningfully affect the rerender time (I started investigating thinning the matrix for https://github.com/conda-forge/petsc4py-feedstock/issues/96, but learned that doing this via skip doesn't save any rerender time).

petsc4py variants have 5 dimensions now:

so each linux target has 72 variants to render.

Surprisingly (to me, at least), profiling with pyinstrument suggests that it spends a huge fraction (80% of render_recipe) in copy.deepcopy copying the MetaData objects for the variants. render_recipe for linux-64 spends fully 10 minutes in MetaData.copy() alone, and less than a minute everywhere else combined.

Rendering just 24 variants results in 24 calls to MetaData.copy, but 82,229,145 calls to copy.deepcopy.

So if copying metadata for variants can be optimized, I think overall a massive amount of rerender time can be saved across conda-forge.

beckermr commented 2 months ago

Yep. See this PR: https://github.com/conda/conda-build/pull/5281

We should get that in.

minrk commented 2 months ago

Nice! That should help. On investigation, I doubt variants needs to be deep-copied at all, but the only way to be safe there would be to switch each variant it to an immutable data type, e.g. frozendict of tuples instead of dict of lists. I suspect that's probably the way to go in the long run.

https://github.com/conda/conda-build/pull/5281 reduces petsc4py rerender from over 30 minutes to under 5. The next limiting factor is in get_logger which completely reinitializes the logger on every call and accounts for 30% of overall render time (logger.setLevel is more than the new pickle-based deepcopy).

beckermr commented 2 months ago

The next limiting factor is in get_logger which completely reinitializes the logger on every call and accounts for 30% of overall render time (logger.setLevel is more than the new pickle-based deepcopy).

Yes and this function has other odd side effects which make conda-build hard to use. We should open an issue.

minrk commented 2 months ago

https://github.com/conda/conda-build/issues/5382

minrk commented 2 months ago

With the optimized PRs and https://github.com/conda-forge/conda-smithy/pull/1967 petsc4py renders in under 5 minutes.

The next limiting factor is in the calls to get_loop_vars because it's being rendered with 13,824 variants. I guess because all possible variants are exploded before used_vars are calculated. I think that's unnecessary and the fix may be small (🤞 ).