grosser / parallel_tests

Ruby: 2 CPUs = 2x Testing Speed for RSpec, Test::Unit and Cucumber
3.37k stars 493 forks source link

ParallelTests.wait_for_other_processes_to_finish can lead to deadlock with more than one parallel_test operation running at a time #289

Open DMA57361 opened 10 years ago

DMA57361 commented 10 years ago

We've recently put parallel_tests into Rails applications tested by Jenkins and ran in to a situation where, if we allowed multiple instances of one particular project to test at the same time, every running instance of that project would freeze up at the end of the rake parallel:spec triggered rspec test suite and never finish.

It was us doing something along the lines of this in an after(:suite), as suggested in the readme, that was the culprit.

if ParallelTests.first_process?
  ParallelTests.wait_for_other_processes_to_finish
  do_something
end

It looks like you grep for processes using TEST_ENV_NUMBER= (https://github.com/grosser/parallel_tests/blob/80a53d/lib/parallel_tests.rb#L7-12), but doing so allows processes from different calls to parallel_test can see each other, and in our case we ended up with a small group of deadlocked processes, each waiting for the others to finish first.

Intuitively I feel this method should only consider processes triggered by the same parent call to parallel_tests, or you could do with a warning in the readme if this is intended behavour.

Moving our work into a separate rake task sidestepped the problem in our case, so I'm afraid I didn't delve any further into the code here to suggest a suitable fix, but thought I'd at least raise an issue as an FYI.

ronwsmith commented 10 years ago

It sounds dangerous to allow two (or more) test suites to run in parallel on the same box. They will use the same databases and have the potential to pollute each other or truncate data needed elsewhere (especially with browser tests). I'd recommend limiting to one run at a time and just add another CI slave if it's not fast enough.

DMA57361 commented 10 years ago

The databases for each run of each job we have are all created to be unique based on the job name (a combination of project name and branch), job run number and now also parallel_test's TEST_ENV_NUMBER so each process of each job run will have it's own database (and we've plenty of clean up keeping all this in check) - concurrent database access by similar jobs doesn't present us with a problem.

And we have a single CI box that has plenty of spare resources (it was spare hardware we had no other use for, so the machine is somewhat over-spec), so adding additional slaves would actually be pretty wasteful when we can run more things on the same box.

Please note that I've already worked around the problem (as said, in our case we could split the post-test job into it's own rake task, meaning we don't need to call wait_for_other_processes_to_finish at all now) - I've logged this issue because it needs fixing and/or documenting for future users.

ronwsmith commented 10 years ago

Ah gotcha. Thanks for the additional detail on your setup. That will help anyone else searching for this issue in the future.

kouno commented 10 years ago

Happened to us too. We had to override number_of_running_processes in: https://github.com/grosser/parallel_tests/blob/3f49fdaef8386e00aa21eb04ce7a4cba70120e6d/lib/parallel_tests.rb#L53

I was going to write a pull request but I believe you are supporting windows too, and I don't have a machine to test it against.

Would it be alright to use https://rubygems.org/gems/sys-proctable to improve this situation? I currently check for the PPID of the current process to count the number of spawned test runner.

grosser commented 10 years ago

ppid checking sounds good I'm doing a bunch of pid handling here, maybe useful: https://github.com/grosser/gem_on_demand/blob/master/spec/gem_on_demand/cli_spec.rb#L9-L22

sys-proctable seems to also support windows, so might be the better choice ...

vgrigoruk commented 10 years ago

Faced the same issue, did somebody have a ready solution for this?

grosser commented 10 years ago

Can you do a PR using https://github.com/grosser/gem_on_demand/blob/master/spec/gem_on_demand/cli_spec.rb#L16-L22 to fix this ? Windows just gets to use the old version.

vjong-gpsw commented 9 years ago

Has this been fixed? I'm seeing this issue with cucumber tests using parallel_tests 1.3.7 with parallel 1.4.1.

grosser commented 9 years ago

don't think so, patch welcome

digglife commented 4 months ago

I'm facing the same issue. But since it was raised almost a decade ago, I think I should ask if it's been fixed(before I try to do a PR).

grosser commented 4 months ago

afaik it has not, PR welcome, might have to add some uniqueness like using the process id to avoid this issue