Some system tests take a very long time to run

marcoskirsch commented 4 years ago

nimi-bot system tests now run in parallel, with a job for each bitness (win32 vs win64) and for each module (nidmm, niscope, nidcpower, nise, etc.)

I noticed some jobs consistently take a very long time compared to the rest. Take the system tests run as part of #1258:

system-tests/win32/nidcpower  Took 2 min 22 sec on nimibot-win-32-2
system-tests/win32/nidigital  Took 2 min 5 sec on nimibot-win-32-1
system-tests/win32/nidmm      Took 2 min 59 sec on nimibot-win-32-2
system-tests/win32/nifgen     Took 3 min 48 sec on nimibot-win-32-2
system-tests/win32/nimodinst  Took 1 min 59 sec on nimibot-win-32-1
system-tests/win32/niscope    Took 3 min 40 sec on nimibot-win-32-1
system-tests/win32/nise       Took 2 min 39 sec on nimibot-win-32-1
system-tests/win32/niswitch   Took 8 min 58 sec on nimibot-win-32-2
system-tests/win32/nitclk     Took 12 sec on nimibot-win-32-1
system-tests/win64/nidcpower  Took 2 min 31 sec on nimibot-win-64-1
system-tests/win64/nidigital  Took 3 min 1 sec on nimibot-win-64-1
system-tests/win64/nidmm      Took 2 min 47 sec on nimibot-win-64-1
system-tests/win64/nifgen     Took 4 min 24 sec on nimibot-win-64-1
system-tests/win64/nimodinst  Took 2 min 20 sec on nimibot-win-64-2
system-tests/win64/niscope    Took 3 min 37 sec on nimibot-win-64-2
system-tests/win64/nise       Took 2 min 49 sec on nimibot-win-64-2
system-tests/win64/niswitch   Took 9 min 35 sec on nimibot-win-64-2
system-tests/win64/nitclk     Took 14 sec on nimibot-win-64-1

By far the worse offender is niswitch which takes over twice as long as the rest of them. We really need to look into what's causing it and fix it.

marcoskirsch commented 4 years ago

The most likely reason I see of why niswitch and nifgen are pretty slow compared to the rest is that they open simulated sessions to devices whose driver runtime uses the DAQmx framework under the hood.

This is:

notoriously slow especially on larger systems (5 seconds not extraordinary)
because of an old bug in the DAQmx framework, must be serialized

System tests for each job runs in a separate agent, but we are running tests for 5 versions of Python in parallel. That would be a lot of multi-second tests gettting serialized!

The mechanism in which tests that use session-based simulation with DAQmx framework devices are serialized is a global lock we call daqmx_sim_db_lock. Just grep for that keyword.

This is only used in src/nifgen/system_tests/test_system_nifgen.py and src/niswitch/system_tests/test_system_niswitch.py.

So a potential way to make these tests faster are:

Simulate a model that doesn't use DAQmx framework. Those work fast and reliably and that's why we use elsewhere. May not be possible due to driver runtime bug (make sure it's filed and referrenced in test code!) or because the thing being tested only exists on specific models.
Use persistent simulation for those tests. The drawback is that now there are more requirements on the system running the tests.

marcoskirsch commented 4 years ago

I looked at the nimi-bot logs from PR #1268. For example, in job win32/job/niswitch/274/console there are some obvious offenders:

========================== slowest 5 test durations ===========================
223.97s call     src/niswitch/system_tests/test_system_niswitch.py::test_continuous_software_scanning
153.20s call     src/niswitch/system_tests/test_system_niswitch.py::test_enum_attribute
0.12s setup    src/niswitch/system_tests/test_system_niswitch.py::test_relayclose
0.08s call     src/niswitch/system_tests/test_system_niswitch.py::test_write_only_attribute
0.07s call     src/niswitch/system_tests/test_system_niswitch.py::test_error_message

As expected, those are tests that use session-based simulation of a device that uses the DAQmx driver framework. But they should really not so long. One mistake is that the global lock we introduced is only needed during Session creation and destruction / close. But we hold on to it for the duration of the test.

We should have a goal that no test takes over 1 second.

tobiasgordon commented 3 years ago

The intermittent failure due to simulated device creation race conditions is being tracked in the internal NI Bug #AB1235678. Fixing that should also improve the test times of those that use session-based simulation of devices using the DAQmx driver framework.

marcoskirsch commented 2 years ago

Following data based on latest runs using new nimibot system (AWS hosted VMs + GitHub Actions). Worst offenders are:

system_tests_win32 (nidigital) succeeded 22 hours ago in 43m 19s
system_tests_win32 (nitclk) succeeded 22 hours ago in 38m 9s

This are the ones we should focus on optimizing because they are the long poles at the moment.

ni / nimi-python

Some system tests take a very long time to run #1264