onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
760 stars 319 forks source link

Parallelize ctest by default #669

Open NathanielMcVicar opened 3 years ago

NathanielMcVicar commented 3 years ago

Would it make sense to parallelize the ctest target by default ("make test" or "cmake --build . --target test") using a method similar to this https://cmake.org/cmake/help/latest/module/ProcessorCount.html#module:ProcessorCount? On some platforms these tests can take a long time to run serially, so it might be nice to let each test run on a different core (as opposed to making multiple cores available to each test as in the above link).

Before I submit this change, I wanted to confirm that the current serially behavior isn't preferred. Thanks!

AlexandreEichenberger commented 3 years ago

@gongsu832 Please respond to this one, not sure what is best, since we share some of these machines with others for the CIs

NathanielMcVicar commented 3 years ago

For what it's worth, the onnx backend tests are already parallel by default if pytest is installed, which is much more CPU hungry. In practice, currently, the most cores ctest will use is ~6.

gongsu832 commented 3 years ago

All tests including ctest are already parallelized in Dockerfiles. Specifically, ctest is parallelized with make ARGS=-j${NPROC} test. NPROC is set by .buildbot/jenkins-build-*.py using an algorithm that takes into account both CPU numbers and memory size.

# Set parallel jobs based on both CPU count and memory size.
# Because using CPU count alone can result in out of memory
# and get Jenkins killed. For example, we may have 64 CPUs
# (128 threads) and only 32GB memory. So spawning off 128
# cc/c++ processes is going to quickly exhaust the memory.
#
# Algorithm: NPROC = min(2, # of CPUs) if memory < 8GB, otherwise
#            NPROC = min(memory / 4, # of CPUs)
MEMORY_IN_GB                = (os.sysconf('SC_PAGE_SIZE') *
                               os.sysconf('SC_PHYS_PAGES') / (1024.**3))
NPROC                       = str(math.ceil(min(max(2, MEMORY_IN_GB/4), os.cpu_count())))
NathanielMcVicar commented 3 years ago

That makes sense, although -j will still override any defaults, so changing it won't result in a different behavior in CI that sets it manually. Regardless, if you prefer the current approach feel free to close this issue.

sstamenova commented 3 years ago

@gongsu832 : Does it make sense to add a default in the cmake files for people who don't use the dockerfiles? I would personally expect the tests to run in parallel unless I did something to make them not as opposed to the other way around.

gongsu832 commented 3 years ago

@NathanielMcVicar Not sure I follow what you are saying. Can you elaborate?

@sstamenova make test not running in parallel by default really is a make behavior. People who don't use the dockerfiles obviously need to run the command themselves and might be expecting the default behavior.

sstamenova commented 3 years ago

The issue is that other tools (such as ninja) will not run in parallel either. So ninja test does not run in parallel even though the default behavior for ninja is to run in parallel.

gongsu832 commented 3 years ago

The issue is that other tools (such as ninja) will not run in parallel either. So ninja test does not run in parallel even though the default behavior for ninja is to run in parallel.

I'm not following. Why doesn't ninja test run in parallel if the default is to run in parallel? Also I don't see how it is related to the make behavior?

NathanielMcVicar commented 3 years ago

I'd like to better understand the thought process here. In general, there is some default behavior for parallelizing (across unit test tools like ctest, pytest, etc. and build tools ninja, make, msbuild, etc.). Typically for most of these tools this is the number of cores. On platforms where that behavior isn't appropriate it can be overridden. If ONNX MLIR instead defaults to serial or small constant numbers of threads, and requires overrides for more parallel builds, that makes sense, but I would like to understand better. The current approach seems slightly inconsistent (for example, using full parallel builds by default with ninja).

As an aside, the issue with ninja vs. make is that ninja (by design) doesn't provide a way to pass arguments to targets (other than using the environment) like make does. See: https://github.com/ninja-build/ninja/issues/1387

gongsu832 commented 3 years ago

Yes $(nproc) was fine most of the time, until we ran into problem with a Jenkins CI server that kept getting killed because it had 128 CPUs (SMT) but only 32GB memory. That was why the algorithm to compute a better NPROC mentioned in https://github.com/onnx/onnx-mlir/issues/669#issuecomment-839293610 was introduced. All build processes and tests have been changed to use the computed NPROC, including building llvm-project by ninja in docker/Dockerfile.llvm-project. Outside of the Jenkins CI docker environment, you are obviously free to do whatever you prefer.

NathanielMcVicar commented 3 years ago

Just to close the loop on this, we have gone with forcing parallel test count using the environment for our own CI. It would be nice to have a consistent approach across the project, but given all the different CI systems that need support, it may not be practical as discussed here.

sstamenova commented 3 years ago

Actually, how about this - what it we added a cmake option to set the parallelism of the tests but had it default to 1? Then for anyone who is not expecting it, the behavior would not change, but for someone who wanted to add parallelism to all of the tests, they could specify it.

gongsu832 commented 3 years ago

Actually, how about this - what it we added a cmake option to set the parallelism of the tests but had it default to 1? Then for anyone who is not expecting it, the behavior would not change, but for someone who wanted to add parallelism to all of the tests, they could specify it.

I'm not sure why you need a cmake option for it. make test already behaves that way. It defaults to serial tests. If you want parallel tests, you just do make ARGS=-j$(nproc) test.

NathanielMcVicar commented 3 years ago

I think we are recovering ground from above now, but this solution doesn't work with ninja (https://github.com/ninja-build/ninja/issues/1387)

gongsu832 commented 3 years ago

So I'm not sure what we are trying to achieve here. Making parallel tests work for ninja? Isn't it a ninja problem? I don't want to start a make vs ninja war. People use whatever tool they prefer. I just don't see how this is onnx-mlir's business.

sstamenova commented 3 years ago

The goal is to have a consistent way of building and running tests that works cross-platform and cross-tool. llvm has mostly achieved this using lit to run the tests and with clever cmake defaults, so we would like to have the same experience with onnx-mlir as well.

Coming back to the tests specifically, ninja, make and Visual Studio ALL behave differently when you try to invoke a target and run the tests in parallel from the command line. At best we can achieve parity by setting environment variables such as CTEST_PARALLEL_LEVEL but that again breaks with the llvm convention of not polluting the environment with settings that one has to remember to set every time or that can impact other builds (setting CTEST_PARALLEL_LEVEL impacts all ctest runs from all projects). On the other hand, we can add a cmake variable that will control this behavior consistently across tools and platforms, will not pollute the environment , will be saved across runs and will provide parallelism for the tests (and unsurprisingly running tests in parallel is the llvm default, though there's a cmake option for lit to change that).

In an ideal world, all of the onnx-mlir tests would move to using lit (which can be used to run all sorts of tests, not just the kind of tests in check-onnx-lit) and then the settings would be controlled consistently for all of the test targets in onnx-mlir, but that is a bigger task than adding a cmake variable for parallel test execution.

gongsu832 commented 3 years ago

In an ideal world, everybody would be just using ARGS=-j$(nproc) cmake --build . --target test and be done with it. The problem is that everybody thinks they can write a better build tool so we end up with 100s of them doing pretty much the same thing.

sstamenova commented 3 years ago

Except ARGS=-j$(nproc) cmake --build . --target test does not work for ninja or Visual Studio only for make. And even if it did work for ninja or Visual Studio, it still wouldn't work on Windows because it's not valid syntax for the Windows command line. This is a very make-oriented approach while the most common way to build llvm is ninja (look at all the buildbots, for example), so I think ninja is the tool onnx-mlir should optimize for, not make which has been falling out of favor in general.

Building on all the work that has been done in llvm for build and test is going to be much more flexible and well-supported long-term.

gongsu832 commented 3 years ago

Of course it does not work for ninja, that's the whole point. On a specific system, if make works perfectly fine, why would I mess with ninja? If it doesn't work on another system (e.g., windows), that's a separate issue (I know the command doesn't work on windows. It's meant to be an illustration, not meant to be taken literally).

Sorry I'm not sure where you get the idea that make is falling out of favor in general. cmake by default still generates Makefiles on Unix-like systems including Mac OSX. And most Linux distros don't even install ninja by default.

Having said all these, if you want to reuse whatever LLVM did with ninja to do parallel tests, that's fine with me. As I mentioned in one of the posts above, outside the Jenkins CI environment, you can do pretty much whatever you want to build and test.

AlexandreEichenberger commented 3 years ago

@gongsu832 in light of your #958 PR, does it cover the intent of this issue or is there more that should be done? Tx