pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.33k stars 637 forks source link

Assertion 'rc == 0' failed in mdb_page_dirty() #18726

Open jtilahun opened 1 year ago

jtilahun commented 1 year ago

Describe the bug Attempting the export goal with the --no-pantsd argument results in failure on my system with the below error message:

jtilahun@JTN86G3:~/devel/monorepo$ ./pants --print-stacktrace --no-export-symlink-python-virtualenv --no-pantsd export --resolve=python-default
/github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/6ae7a55/lmdb-sys/lmdb/libraries/liblmdb/m:2126: Assertion 'rc == 0' failed in mdb_page_dirty()
Aborted (core dumped)

The error message indicates a possible relation to other bug reports, e.g. bmatsuo/lmdb-go/issues/131.

Pants version 2.15.0

OS Linux

Additional info The context is that attempting the export goal without the --no-pantsd argument results in failure on my system with the below error message:

jtilahun@JTN86G3:~/devel/monorepo$ ./pants --print-stacktrace --no-export-symlink-python-virtualenv export --resolve=python-default
11:30:06.43 [INFO] Initializing scheduler...
11:30:07.37 [INFO] Scheduler initialized.
Traceback (most recent call last):
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 123, in main
    PantsLoader.main()
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 110, in main
    cls.run_default_entrypoint()
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 92, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 89, in run
    return remote_runner.run(start_time)
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 123, in run
    return self._connect_and_execute(pantsd_handle, executor, start_time)
  File "/home/jtilahun/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 161, in _connect_and_execute
    return PyNailgunClient(port, executor).execute(command, args, modified_env)
native_engine.PantsdClientException: The pantsd process was killed during the run.

If this was not intentionally done by you, Pants may have been killed by the operating system due to memory overconsumption (i.e. OOM-killed). If you keep seeing this error message, try the troubleshooting steps below. If none of those help, please consider filing a GitHub issue or reaching out on Slack so that we can investigate the possible memory overconsumption (https://www.pantsbuild.org/docs/getting-help).
 - Exit other applications, including applications running in the background.
 - Set the global option `--pantsd-max-memory-usage` to reduce Pantsd's memory consumption by retaining less in its in-memory cache (run `./pants help-advanced global`).
 - Disable pantsd with the global option `--no-pantsd` to avoid persisting memory across Pants runs, although you will miss out on additional caching.

According to the error message, disabling pantsd with the global option --no-pantsd is one possible troubleshooting step. However, attempting the export goal with the --no-pantsd argument also results in failure on my system.

stuhood commented 1 year ago

If this is reproducible, then you likely have a corrupted LMDB store... we're currently not in a position to accept a copy of the database to triage that, so your best bet will likely be to remove ~/.cache/pants/lmdb_store. Sorry for the trouble!

If you find a sequence of steps that reproduces the problem (or it's recurring frequently) then please definitely re-open!

jtilahun commented 1 year ago

Hmm, strange. I ran sudo rm -r ~/.cache/pants/lmdb_store in order to remove the LMDB store, and now the export goal with the --no-pantsd argument succeeds. I don't know of a sequence of steps in order to reproduce the corruption of the LMDB store though. What is the LMDB store and how is it used?

stuhood commented 1 year ago

Both the --pantsd and --no-pantsd errors are likely the same error under the hood: the only difference is whether the crash happens in the foreground or the background. So both commands should now be fine.

The LMDB store contains all cache entries and their file/directory contents.

jtilahun commented 1 year ago

The export goal without the --no-pantsd argument also succeeds now. So it does seem like both errors are related.

I don't know what to say about the LMDB store. I don't know how the LMDB store could have gotten corrupted.

juliaproctor commented 1 year ago

This happened on my system as well. Removing the LMDB store worked but I don't know how/why it got corrupted in the first place. Wanted to bring it up as it seems there is a bug somewhere

jtilahun commented 1 year ago

I got another report of the error message

/github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/6ae7a55/lmdb-sys/lmdb/libraries/liblmdb/m:2126: Assertion 'rc == 0' failed in mdb_page_dirty()

from yet another colleague who was unable to run the export goal with the same arguments.

There is most certainly a bug somewhere. What is the path forward here? This issue was closed despite the fact that the problem clearly still exists. Please advise regarding a path forward. Thank you.

huonw commented 1 year ago

Reopening per https://pantsbuild.slack.com/archives/C046T6T9U/p1690955357162709, which very reasonably points out that this keeps occurring, so it's unfortunate to have the issue be closed.

Getting a sense of the behaviour here:

  1. @jtilahun you mention a colleague hitting this too, does that mean it was the same repo/codebase but on a different/unrelated machine?
  2. @juliaproctor, do you happen to be working with @jtilahun on the same codebase or are you hitting this completely separately?

That said, as @stuhood points out, I imagine this might be very hard to narrow down without some more hints about the conditions, e.g. can you share your pants.toml and the contents of the lockfile for the python-default resolve?

juliaproctor commented 1 year ago

Yes, I work with @jtilahun on the same codebase

jtilahun commented 1 year ago

Thanks @huonw for reopening this issue.

The codebase is common among all of the occurrences I'm aware of. Yes, the other colleague I mentioned in my last comment hit this also on the same repository/codebase but on another machine. So that's a total of three discrete incidents of this (including my own) as of this writing.

I understand that we'll need more information to help pinpoint what's happening. I sent our pants.toml and the contents of the lockfile for the python-default resolve to @huonw via Slack DM.