fix: Use a lock when waiting for containers with --pmr-multiprocess-safe - Githubissues

schireson / pytest-mock-resources

Pytest Fixtures that let you actually test against external resource (Postgres, Mongo, Redshift...) dependent code.

https://pytest-mock-resources.readthedocs.io/en/latest/quickstart.html

MIT License

183 stars 19 forks source link

fix: Use a lock when waiting for containers with --pmr-multiprocess-safe #141

Closed JonM0 closed 2 years ago

JonM0 commented 2 years ago

I noticed that once in a while tests (with pytest-xdist) would error out because "port is already allocated", leaving behind a stopped container and a volume. Let me know if you prefer if I open an issue detailing this. I managed to fix it on my end by putting wait_for_container inside a lock shared by all attempts to get a container on a certain port. I tested it with a couple of fixtures and it seemed to work, but I was unable to run automated tests on my machine (neither using windows 11 nor wsl2 ubuntu, hopefully github actions will say everything is fine).

DanCardin commented 2 years ago

Unfortunately, due to the way CI works (at least today, perhaps we could try dind), we're connecting to already spun-up instances of these services, so this feature ends up being much more challenging to test in CI.

The whole retry mechanism exists both because of container startup time, but also specifically for this issue (which is why wait_for_container has a specific carve out for that error (whichever subprocess "wins" by spinning up the container first is the one that writes to the file).

It's odd to me that that exception isn't getting ignored by https://github.com/JonM0/pytest-mock-resources/blob/patch-sync-container/src/pytest_mock_resources/container/base.py#L130

Out of curiosity, which container are you hitting this with?

With all that being said, i dont think it's necessarily a bad thing for that block to use a lock, so I'll do some local testing.

DanCardin commented 2 years ago

And again, if you want to bump the patch version again, once i've done some local sleuthing, i can merge and release directly.

JonM0 commented 2 years ago

I am using postgres:14 with Docker Desktop wsl2. Most of the time running with 12 processes I see all the loser containers being created and dying, but once in a while a test errors out and a container is not deleted.

DanCardin commented 2 years ago

Release in 2.2.2, thanks again!

DanCardin commented 2 years ago

I think I've decided that this lock is actually happening too high up. We need to be locking around just the containers.run call, otherwise the lock needs to be acquired around every test startup, even after all the containers have been started.

I there's a noticeable slowdown due to lock contention if you have a lot of fast tests.

JonM0 commented 2 years ago

You need at least one call to check_fn inside the lock before the run call, otherwise when all the processes block on it at the start they all try to run a new container when they finally acquire it. But ye it probably helps to check if a container is already up before locking.

DanCardin commented 2 years ago

Ahh yes! I'll see if I can't prototype something up to this ^ effect

DanCardin commented 2 years ago

I think I'm actually not seeing any practical performance benefit from moving the lock further down, perhaps given that we need to do the additional aliveness check before attempting to start the container, i'm not sure. It's arguably less repetitive, but I ended up using a single level recursion inside wait_for_container and it was also arguably more confusing, so if there's no perf benefit i think it's not worth.

Either I'm doing something wrong, or the lock contention is unavoidable if you want to avoid racing the processes against one another, which is the source of the original PR. So unless we add a flag to account for the (apparent) os-specific behavior (which i'm not especially inclined to do), then...

tl;dr it seems like i dont get obvious perf benefit, so it's good as-is!