Could not remove pg_stat - unsatisfied constraints (Directory not empty)

ocharles commented 4 years ago

I have been happily using tmp-postgres locally, but on CI I got:

Exception: /tmp/tmp-postgres-data-4de8de8b7e43c151/pg_stat: removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removeDirectory: unsatisfied constraints (Directory not empty)

I've never seen this before, any idea what could have happened?

jfischoff commented 4 years ago

ah crap.

This is a bug I'm really struggling to kill.

I think that as removeDirectoryRecursive is running new files are being added to the directory.

This is the silly hack I have left in.

https://github.com/jfischoff/tmp-postgres/blob/master/src/Database/Postgres/Temp/Internal/Config.hs#L393

The other thing I have tried is rename folder before removing but that also did not work.

I guess I should try the idea I had of locking the directories ... but I am not really sure what I was thinking because I would lock in the ~~same process~~ (I see now ... it is postgres that is writing into the folder so it would be a different process).

I could also just try deleting over and over again for some amount of time.

ocharles commented 4 years ago

Ah, you have experienced this too then?

jfischoff commented 4 years ago

Not recently but yes. I was pretty sure I had not solved it.

ocharles commented 3 years ago

We're hitting this a lot at the moment, so I'm doing some investigation into this. Some findings:

Disabling copy-on-write doesn't stop the problem happening. Or more, copyOnWrite = False and copyOnWrite = cowCheck both cause the bug (I haven't confirmed what cowCheck is on the failing machine yet).
The tests are failing in CI builds, which are all Nix builds. Interestingly when we don't have the Nix sandbox enabled (currently our default), tests fail with this error about 90% of the time. With the Nix sandbox enabled, I am yet to see this bug happen. So it seems to be sensitive to a tainted environment.
Next, I note that find /tmp -name 'tmp-postgres-data-*' -type d | wc -l = 2653, so there's a lot of junk lying around from previous runs!
I ran a test suite until it failed with this bug. I already had a list of all existing /tmp/tmp-postgres-data directories, and the directory that failed to be removed wasn't in the list - so it's not like the random directory names are picking existing directories.
I moved /tmp/tmp-postgres-* to /tmp/old-tmp-postgres and ran a failing test suite again. This test suite still failed, and I was left with 6 tmp-postgres-data directories. This is a somewhat surprising number - this test suite uses snapshots and opens 11 connections, so why are almost half of the directories still around? I only observed one exception, too! This makes me think that sometimes this cleanup function entirely runs and then PostgreSQL writes more files. So are we running the clean up function too early?
I notice that it's only ever a complaint about the pg_stat directory. Reading https://www.postgresql.org/docs/current/monitoring-stats.html, we see

When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts.

So I'm again wondering if we're doing the cleanup too early.
Looking at the cleanup code, I see that the cleanup routine is (probably) ultimately called from: https://github.com/jfischoff/tmp-postgres/blob/8d1fe38b82e3d17b090d10cf34dbd8b3df1c0c18/src/Database/Postgres/Temp/Internal.hs#L366 Now I'm suspicious - we know that shutting the server down can write files. This change was introduced in https://github.com/jfischoff/tmp-postgres/pull/215 which was introduced before this issue - so could this be the culprit?
I try reverting #215 https://github.com/circuithub/tmp-postgres/commit/b90ed912c069b67cf61df750e852deb3b4ed6da2 and rebuild using my fork.
With this change, I've managed to run the test suite 50 times in succession. 48 passed without exception. Something is still going wrong, but it appears to be a different problem (unrelated to tmp-postgres entirely).

I think my sandboxing stuff is just luck that causes the exception to happen less frequently.

I suggest that #215 is reverted, or least configurable. I'll live with an 8ms penalty if tests actually work reliably. Also, this time is spent per test, and we run tests in parallel anyway

jfischoff commented 3 years ago

Thanks @ocharles. I think you are probably right and reverting #215 is the problem. I'll revert it.

the-dr-lazy commented 3 years ago

As a temporary hack for anyone who have same issue:

Just catch the close action.

If there is a better solution, I'd be happy to see.

ocharles commented 3 years ago

The better solution is to just revert b90ed91

ocharles commented 3 years ago

@jfischoff Do you still plan to revert the above mentioned commit? We haven't had any more problems since we reverted it (we've been running https://github.com/circuithub/tmp-postgres since my last comment)

codygman commented 3 years ago

@jfischoff My team is also affected by this, it looks like our fix will be a temporary fork of tmp-postgres reverting b90ed91 since @ocharles said it's been working for some time.

This error started showing up for us seemingly out of nowhere.

ocharles commented 2 years ago

@jfischoff Polite ping. I still think reverting #215 is worth doing.

jfischoff commented 2 years ago

@ocharles I haven't had time to work on this project recently, but I should have time this week. I'll take a look. Thanks for the ping.

parsonsmatt commented 4 months ago

I just received our first report on this. Is there anything I can do to help diagnose or get the #215 revert moving along?

jfischoff / tmp-postgres

Could not remove pg_stat - unsatisfied constraints (Directory not empty) #251