hpc / pavilion2

Pavilion is a Python 3 (3.5+) based framework for running and analyzing tests targeting HPC systems.
https://pavilion2.readthedocs.io/
Other
44 stars 20 forks source link

Pav Status/Result hangs, Run_Complete into BUILD_CREATED #334

Open CalvinDSeamons opened 4 years ago

CalvinDSeamons commented 4 years ago

I ran into a very strange bug while testing snow today. This happened on the Yellow front end where there are approximately 11,000 more tests sitting in the working_dir than the Turquoise, just wanted to mention that as that seems to be the only notable difference.

After launching my tests I ran watch pav status. Upon the loading of the status table (which only took 3-5 seconds) a few license-tests had already completed with PASS and the rest where all SCHEDULED, everything seemed fine. 20ish minutes later after everything had finished my watch pav status which updates every 3 seconds showed everything as fine, PASS. I quit out (don't ask me why) and ran just pav status as to copy the contents out into the ticket. The command hanged, as did pav result or any permutation of pav log build/run $series/$id ect. I even logged into snow from a different terminal session, loaded pavilion/2.0 and could not access the test run. Upon using thecat command i received the following status file from one of the tests that I had observed passing:

2020-10-21T11:21:08.484894 CREATED Created status file.
2020-10-21T11:21:08.485955 CREATED Test directory and status file created.
2020-10-21T11:21:08.490230 BUILD_CREATED Builder created.
2020-10-21T11:21:08.493727 CREATED Test directory setup complete.
2020-10-21T11:21:16.778993 BUILD_REUSED Test 171aceb2e5e39623 run 13242 reusing build.
2020-10-21T11:21:23.182369 SCHEDULED Test slurm has job ID 3814891.
2020-10-21T11:47:35.252408 PREPPING_RUN Converting run template into run script.
2020-10-21T11:47:35.255769 RUNNING Starting the run script.
2020-10-21T11:47:35.261001 RUNNING Currently running.
2020-10-21T11:47:35.282176 RUN_DONE Test run has completed.
2020-10-21T11:47:35.289546 RESULTS Parsing 6 result types.
2020-10-21T11:47:35.292427 RESULTS Performing 0 result evaluations.
2020-10-21T11:47:35.308790 COMPLETE The test completed with result: PASS
2020-10-21T12:02:03.342975 BUILD_CREATED Builder created.

The PASS is what I observed inside watch pav status. When I exited watch pav status the test status changed to BUILD_CREATED and was unreachable from pav status.

I thought I'd make a note of it as @kjeverson could also not access anything through pav status. I was able to fix this by using scancel -u $user; pav cancel --all; module unload and reran my test. To whomever wants to investigate this further s377 still hangs when called and can be poked at in the yellow.

CalvinDSeamons commented 4 years ago

Update: 0013259 appears to be the test that is hanging.

kjeverson commented 4 years ago

I found out that the hanging was caused by one of the test_run's configs keeping a lock file (there was a config.lockfile in it's directory.)

But other than that all of the test_run statuses still end up in the BUILD_CREATED state:

2020-10-21T11:21:24.852975 RESULTS Parsing 6 result types.
2020-10-21T11:21:24.853997 RESULTS Performing 0 result evaluations.
2020-10-21T11:21:26.182685 COMPLETE The test completed with result: PASS
2020-10-21T15:52:09.327888 BUILD_CREATED Builder created.
2020-10-21T15:58:54.338062 BUILD_CREATED Builder created.
CalvinDSeamons commented 4 years ago

This was another test I thought I'd throw into the issue:

2020-10-21T11:21:08.484894 CREATED Created status file.
2020-10-21T11:21:08.485955 CREATED Test directory and status file created.
2020-10-21T11:21:08.490230 BUILD_CREATED Builder created.
2020-10-21T11:21:08.493727 CREATED Test directory setup complete.
2020-10-21T11:21:16.778993 BUILD_REUSED Test 171aceb2e5e39623 run 13242 reusing build.
2020-10-21T11:21:23.182369 SCHEDULED Test slurm has job ID 3814891.
2020-10-21T11:47:35.252408 PREPPING_RUN Converting run template into run script.
2020-10-21T11:47:35.255769 RUNNING Starting the run script.
2020-10-21T11:47:35.261001 RUNNING Currently running.
2020-10-21T11:47:35.282176 RUN_DONE Test run has completed.
2020-10-21T11:47:35.289546 RESULTS Parsing 6 result types.
2020-10-21T11:47:35.292427 RESULTS Performing 0 result evaluations.
2020-10-21T11:47:35.308790 COMPLETE The test completed with result: PASS
2020-10-21T12:02:03.342975 BUILD_CREATED Builder created.
2020-10-21T15:32:14.386945 BUILD_CREATED Builder created.
2020-10-21T15:33:05.268400 BUILD_CREATED Builder created.
2020-10-21T15:34:35.869780 BUILD_CREATED Builder created.
2020-10-21T15:36:40.955206 BUILD_CREATED Builder created.
2020-10-21T15:37:49.399664 BUILD_CREATED Builder created.
2020-10-21T15:39:05.718075 BUILD_CREATED Builder created.
2020-10-21T15:40:00.666830 BUILD_CREATED Builder created.
2020-10-21T15:45:19.124162 BUILD_CREATED Builder created.
2020-10-21T15:48:21.269450 BUILD_CREATED Builder created.
2020-10-21T15:48:53.542587 BUILD_CREATED Builder created.
2020-10-21T15:49:45.782932 BUILD_CREATED Builder created.
2020-10-21T15:50:25.110590 BUILD_CREATED Builder created.
2020-10-21T15:52:09.117636 BUILD_CREATED Builder created.