Open ocharles opened 4 years ago
ah crap.
This is a bug I'm really struggling to kill.
I think that as removeDirectoryRecursive is running new files are being added to the directory.
This is the silly hack I have left in.
The other thing I have tried is rename folder before removing but that also did not work.
I guess I should try the idea I had of locking the directories ... but I am not really sure what I was thinking because I would lock in the same process (I see now ... it is postgres that is writing into the folder so it would be a different process).
I could also just try deleting over and over again for some amount of time.
Ah, you have experienced this too then?
Not recently but yes. I was pretty sure I had not solved it.
We're hitting this a lot at the moment, so I'm doing some investigation into this. Some findings:
copyOnWrite = False
and copyOnWrite = cowCheck
both cause the bug (I haven't confirmed what cowCheck
is on the failing machine yet).find /tmp -name 'tmp-postgres-data-*' -type d | wc -l = 2653
, so there's a lot of junk lying around from previous runs!/tmp/tmp-postgres-data
directories, and the directory that failed to be removed wasn't in the list - so it's not like the random directory names are picking existing directories./tmp/tmp-postgres-*
to /tmp/old-tmp-postgres
and ran a failing test suite again. This test suite still failed, and I was left with 6 tmp-postgres-data
directories. This is a somewhat surprising number - this test suite uses snapshots and opens 11 connections, so why are almost half of the directories still around? I only observed one exception, too! This makes me think that sometimes this cleanup function entirely runs and then PostgreSQL writes more files. So are we running the clean up function too early?I notice that it's only ever a complaint about the pg_stat
directory. Reading https://www.postgresql.org/docs/current/monitoring-stats.html, we see
When the server shuts down cleanly, a permanent copy of the statistics data is stored in the
pg_stat
subdirectory, so that statistics can be retained across server restarts.
So I'm again wondering if we're doing the cleanup too early.
tmp-postgres
entirely).I think my sandboxing stuff is just luck that causes the exception to happen less frequently.
I suggest that #215 is reverted, or least configurable. I'll live with an 8ms penalty if tests actually work reliably. Also, this time is spent per test, and we run tests in parallel anyway
Thanks @ocharles. I think you are probably right and reverting #215 is the problem. I'll revert it.
As a temporary hack for anyone who have same issue:
If there is a better solution, I'd be happy to see.
The better solution is to just revert b90ed91
@jfischoff Do you still plan to revert the above mentioned commit? We haven't had any more problems since we reverted it (we've been running https://github.com/circuithub/tmp-postgres since my last comment)
@jfischoff My team is also affected by this, it looks like our fix will be a temporary fork of tmp-postgres reverting b90ed91 since @ocharles said it's been working for some time.
This error started showing up for us seemingly out of nowhere.
@jfischoff Polite ping. I still think reverting #215 is worth doing.
@ocharles I haven't had time to work on this project recently, but I should have time this week. I'll take a look. Thanks for the ping.
I just received our first report on this. Is there anything I can do to help diagnose or get the #215 revert moving along?
I have been happily using
tmp-postgres
locally, but on CI I got:I've never seen this before, any idea what could have happened?