Closed njsmith closed 5 years ago
I can't reproduce in master or njs-sow-2. All tests run except runner_test.py (see comment).
Perhaps we can debug together. My schedule is open.
Unfortunately I was AFK during the day – going to take a look now, but I guess you're probably AFK now :-). If I don't get it tonight then maybe we can compare notes tomorrow...
Just leaving the kbd behind. I have a couple of meetings tomorrow, but ping me and we'll see. I didn't review the code, just tested, and tried to reproduce the issue.
OK, I built a new totally-vanilla vagrant image from scratch (just ran vagrant up
, then installed redhat-dev
and sirepo-dev
, then ran pytest tests
). And then I forgot about it for a while, which I guess was crucial :-). I get 3 failures:
Those two sqlite failures seem to have like a 5+ minute timeouts on them before they fail, which I guess is why I thought it had frozen before!
Possibly the two sqlite locking-related errors are a side-effect of having my src/ dir on NFS? No idea about test_get_data_file
though.
And for reference, here's the results on my original VM with the njs-sow-2 branch:
So, the same three errors, though test_get_data_file
has switched back to AttributeError: 'Dict' object has no attribute 'nextRequest'
, instead of AssertionError: /home/vagrant/src/radiasoft/sirepo/tests/server_work/db/user/WuZ7gm5M/elegant/VEcL0HN7/bunchReport/elegant.bun: invalid or empty output from sddsprintout
I just re-ran test_get_data_file
on the clean VM, to check whether there's some randomness, and indeed, this time it gave me AttributeError: 'Dict' object has no attribute 'nextRequest'
.
Conclusion so far: none of the failures can be attributed to my branch or to my having somehow screwed up my main VM, using NFS for src/
may cause some problems, and there's still a mystery about test_get_data_file
.
NFS definitely causes problems. I debugged a speed problem this week on another developer's VM with NFS-mounted sources. lstat (git status) is an order of magnitude or two slower. It's good to know that this happens on Linux hosts, too, because I thought it was a Mac issue.
It's really odd that you got database is locked
, because it is a fresh db unless you happen to be setting SIREPO_SRDB_ROOT or SIREPO_SERVER_DB_DIR. I don't think the tests clear those values, which is probably a bug. I wish people would print more context in error messages. This contains no information about what is locked:
OperationalError: (sqlite3.OperationalError) database is locked [SQL: u'PRAGMA table_info("user_t")'] (Background on this error at: http://sqlalche.me/e/e3q8)
For the sdds error, you can run this command manually to see what is output:
sirepo/template/elegant_common.py:61:subprocess_output ['sddsprintout', '-columns', '-spreadsheet=csv', '/home/vagrant/src/radiasoft/sirepo/tests/server_work/db/user/WuZ7gm5M/elegant/VEcL0HN7/bunchReport/elegant.bun']: exit=1 err=Error for sddsprintout:
It's really odd that you got
database is locked
, because it is a fresh db unless you happen to be setting SIREPO_SRDB_ROOT or SIREPO_SERVER_DB_DIR. I don't think the tests clear those values, which is probably a bug.
I suspect it's something like: sqlite tries to lock the file. The lock command returns an error, meaning "this NFS mount doesn't have file-locking support". sqlite misinterprets the error as meaning "this file is already locked".
I just add nfs_version: 4
to my Vagrantfile. (Vagrant's default is to use NFS v3 for some reason. v4 has locking integrated into the protocol, rather than requiring some weird sideband protocol that always breaks.) Now those two tests pass, and the whole thing finishes in ~2 minutes instead of ~15:
OK, I tracked down this last failure in test_get_data_file
.
Here's what's happening:
"pending"
into the job's status
file."running"
into the job's status
filepkio.write_text
, which updates the file in-place by truncating it, and then writing the new text._simulation_run_status
in server.py
is reading the status
file to figure out what's going on. Somehow, it manages to read it during the period when it's truncated, so the status it reads out is ""
. (This is a race condition under any circumstances, but I guess the reason I'm hitting it so much more reliably than you is probably something to do with NFS again.)_simulation_run_status
, it does is_running = rep.job_status in _RUN_STATES
. Here rep.job_status
is ""
, so we have is_running = False
_simulation_run_status
ends up with is_processing = True
and is_running = False
, which makes it think it has some stale job running around, so calls runner.job_race_condition_reap
to kill the job.server.py
tells the test that the job terminated unexpectedly, and the test isn't expecting that, so it crashes.So it's a real bug, but given that (a) you're not hitting this in your production configuration, for whatever reason, and (b) this is exactly the code we're trying to get rid of (and this is why we're trying to get rid of it!), I don't think we should worry about this test failure.
Excellent debugging, thanks!
Yes, let's ignore the test failure.
I added checking nfs v4 on macs. I don't think it works, but it's what we use Linux-Linux, and it works much better, of course.
I'm in a crunch this week so I don't have time to review the code. I'll try to get the review done by this weekend.
There's an intermittent runner_test.py hang with a zombie sirepo process. The test process is in a loop with a rapid polling of runner.sock which no long exists:
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
stat("/home/vagrant/src/njsmith/sirepo/tests/runner_work/db/runner.sock", 0x7ffdb8c7df70) = -1 ENOENT (No such file or directory)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
stat("/home/vagrant/src/njsmith/sirepo/tests/runner_work/db/runner.sock",
Here's the ps:
5655 pts/2 S+ 0:00 bash test.sh
5661 pts/2 S+ 0:01 python setup.py test
5887 pts/2 Sl+ 0:00 python setup.py test
5979 ? Ssl 0:00 orted --hnp --set-sid --report-uri 23 --singleton-died-pipe 24 -mca state_novm_select 1
5986 pts/2 Z+ 0:00 [sirepo] <defunct>
orted is running, which happens because mpi4py starts it, but it is detached from the group so hard to say if it is the cause of the zombie. Here's
4501 pts/0 Ss 0:00 -bash USER=vagrant LOGNAME=vagrant HOME=/home/vagrant PATH=/usr/local/bin:/usr/bin MAIL=/var/mail/vagrant SHELL=/bin/bash SSH_
4843 pts/0 S+ 0:00 \_ screen -r
1145 pts/1 Ss 0:00 /bin/bash
1405 pts/1 S+ 0:03 \_ emacs
1407 pts/2 Ss 0:00 \_ /bin/bash
5655 pts/2 S+ 0:00 | \_ bash test.sh
5661 pts/2 S+ 0:01 | \_ python setup.py test
5887 pts/2 Sl+ 0:00 | \_ python setup.py test
5986 pts/2 Z+ 0:00 | \_ [sirepo] <defunct>
The test is failing when run directly:
Traceback (most recent call last):
File "/home/vagrant/src/njsmith/sirepo/tests/runner_test.py", line 86, in test_runner_myapp
run.nextRequest
File "/home/vagrant/src/radiasoft/pykern/pykern/pkcollections.py", line 56, in __getattr__
return self.__getattribute__(name)
AttributeError: 'Dict' object has no attribute 'nextRequest'
The failure running individually was due to the 'complete' vs 'completed' that I had leftover. After I changed that, it worked, but the test still hangs from bash test.sh.
There's an intermittent runner_test.py hang with a zombie sirepo process
So this apparently is a difference between pytest tests
and setup.py test
. These seem to be the same except that if you do setup.py test
then it sets PYTHONPATH
to:
This is then inherited by our py3 process, and somehow things don't work out right when we try to run py3 with 122 copies of the py2 environment's site-packages added to its sys.path
. If I add a del os.environ['PYTHONPATH']
at the top of the test, then setup.py test
starts working again.
Maybe this makes sense to PJ Eby, but it sure doesn't make sense to me.
What environment does test.sh
run in? Is it run after everything is installed into the local python environment, to check that it works-as-deployed? Is the goal to test the installed sirepo, or the sirepo in the source tree?
I'm trying to figure out which of these options is best:
test.sh
, replacing setup.py test
with pytest tests
test.sh
, replacing setup.py test
with PYTHONPATH=$PWD pytest tests
runner_test.py
, to do the del os.environ['PYTHONPATH']
hack (which patches over the immediate problem, but breaks things if you want to set PYTHONPATH
before running pytest
)I had to figure out why PYTHONPATH was being created so badly. I asked for these changes: https://github.com/pypa/setuptools/pull/1709 They won't put it in, because I didn't follow procedures. That won't fix things for us right now.
I think the fix is del runner_env['PYTHONPATH']
, because the entire python environment should be setup completely by pyenv exec
.
test.sh runs in docker build. It performs validation, runs the javascript test, and deploys to pypi if it has the credentials to do so. Since you are calling pyenv exec directly, I don't think it can run in tox or other environments that modify PYTHONPATH. py3 and py2 have to be in agreement, and that's the only environment it needs to run in. That being said, test.sh runs pksetup.PKDeploy which runs tox, which will be a problem unless we skip this test, which we should in that case.
One thing I did realize is that the SIREPO_SRDB_ROOT setting is probably not enough. We should cascade the entire config environment. pkconfig could do that, but that's for another time.
This definitely fixes the python setup.py test
hanging problem:
diff --git a/tests/runner_test.py b/tests/runner_test.py
index c2303bbf..ce950f6e 100644
--- a/tests/runner_test.py
+++ b/tests/runner_test.py
@@ -21,6 +21,7 @@ def test_runner_myapp():
# Check if the py3 environment is set up
py3_env = dict(os.environ)
+ del py3_env['PYTHONPATH']
py3_env['PYENV_VERSION'] = 'py3'
returncode = subprocess.call(
['pyenv', 'exec', 'sirepo', '--help'], env=py3_env
I'm trying to figure out which of these options is best: modify test.sh, replacing setup.py test with pytest tests
That option seems best to me. We're looking to deprecate setup.py <anything>
, including test.
If you're relying on the setup_requires
support that setup.py test
offers, that's also deprecated. Our best recommendation if you need to install packages to run tests is to use tox.
Thanks for the notification @jaraco. I didn't know setup.py <anything>
was deprecated.
I wonder if there's a plan for a command to create documentation, deployment, etc. per https://github.com/pypa/setuptools/issues/1684#issuecomment-471205138. Is the plan to rely on tox for everything?
I wouldn’t say they’re deprecated yet, only discouraged. Tox may be suitable and it’s a model I’ve used for docs with success.
I guess it’s fair to say there isn’t a plan, just intentions and exploration and experimentation.
Using the vagrant image, running
python setup.py tests
in the sirepo source dir, on latest master (3d1d4623), I get:And then it seems to hang forever.
If I do
pytest tests -v -s --fulltrace
, and hit control-C after it's stuck, I get a ton of output, that suggests it's stuck inwaitpid
?I also tried running just the
server_tests.py
file, and it didn't freeze, buttest_get_data_file
failed. I guess this may be a second, unrelated issue:CC @robnagler