grosser / parallel_tests

Ruby: 2 CPUs = 2x Testing Speed for RSpec, Test::Unit and Cucumber
3.38k stars 494 forks source link

"Tests failed" but none listed-- child process dying? #819

Open OxSon opened 3 years ago

OxSon commented 3 years ago

Intermittently (about 20-30% of the time), my test pipeline fails and says Tests Failed at the end of the job, despite earlier results summary saying 0 failures. It looks like not all the tests are being run as in the summary text: "x tests, 0 failed ..." x is 4700 for successful pipelines and less for ghost-failure pipelines.

Example output for a successful pipeline:

4700 examples, 0 failures, 11 pendings
Took 149 seconds (2:29)

And a ghost-failure pipeline:

3878 examples, 0 failures, 9 pendings
Tests have failed for a parallel_test group. Use the following command to run the group again:
bundle exec rspec spec/ <SNIP: a bunch of *_spec.rb files> --seed 36611
Took 217 seconds (3:37)
Tests Failed

Sometimes more than one bundle exec rspec spec/...... group is shown at the end, and in the X examples, 0 failures section, X is even smaller.

The relevant section of our CI looks like this (this is Gitlab CI):

test:
  stage: test
  except:
    - tags
  script:
    - export PARALLEL_TEST_PROCESSORS=6
    - export PARALLEL_TEST_FIRST_IS_1=true
    - bundle exec rake parallel:setup
    - JRUBY_OPTS=--debug bundle exec parallel_rspec -n 6 ./spec --verbose

My .rspec looks like this:

--order rand
--tty
--color
--format progress
--format ParallelTests::RSpec::SummaryLogger --out tmp/spec_summary.log

I've tried using the Gitlab parallel feature but it has screwed up my code-coverage results and so I've abandoned that route. I've tried various fixes listed in other similar, closed issues as applicable with no luck. No backtrace is reported by the failing child processes.

I noticed in other issues you often suggested adding something to the runners that prints something like "im alive!" every minute... I'm not sure how to do that effectively for Gitlab CI.

I would appreciate any troubleshooting help you can provide! Thank you :)

grosser commented 3 years ago

afaik what happens is that one of the processes dies so that could be a rouge exit or abort somewhere in the tests/code or maybe it gets oomkilled it prints the group that failed, so it must be happening in there somewhere rerunning the failed group did not help right ?

koya-masuda commented 1 month ago

Same here in GitHub Actions.

1690 examples, 0 failures, 10 pendings

Took 507 seconds (8:27)
Error: Process completed with exit code 1.

Currently, I use version 1.23.0. Is there a possibility that updating to the latest version could resolve the issue?

grosser commented 1 month ago

give it a try, there has been no explicit fix for this issue, but maybe it was fixed by accident or at least gives a better backtrace now

koya-masuda commented 1 month ago

Okay. I'm gonna try it and make a small test case happens this issue again. Thank you :)