Performance testing questions

ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal

BSD 2-Clause "Simplified" License

7 stars 5 forks source link

Performance testing questions #178

Closed graybeal closed 3 years ago

graybeal commented 3 years ago

These are questions that have come up as I've been reviewing the performance tests on the pages https://docs.google.com/spreadsheets/d/1NmnzrNmWvLYMiaBjHEqjTi__O4G0lKqZmJHXvt3BHxw/edit#gid=757293230 and https://docs.google.com/spreadsheets/d/1NmnzrNmWvLYMiaBjHEqjTi__O4G0lKqZmJHXvt3BHxw/edit#gid=1388250908, and the summary page at https://docs.google.com/spreadsheets/d/1NmnzrNmWvLYMiaBjHEqjTi__O4G0lKqZmJHXvt3BHxw/edit#gid=61102189.

Why are the Jenkins test setup times (see summary page) consistently longer for Allegrograph than 4store? For the API, as much as 80% longer. Is this related to the fact the time spent on Jenkins overhead tasks changes a lot from run test run to the next? Is it possible that Allegrograph is filling more of our Appliance memory resources, forcing more memory swaps during testing?
In API: How is it that there are two 'test_download_acl_only' test, and in one of them AG is much faster, and in the other 4store is much faster? (These were the two most significant variations between the two systems!) And if you look at columns BL and BM for 4store results, you will see that the value is always either around 3-4 or around 30, but there is no consistency—each pair of values in BL and BM have one value around 3-4 and the other around 30, but which value is in which column varies.
In API: Why do the test_download_ontology tests, and one of the test_download_acl_only tests, take 25-30 seconds long in 4store and 60 seconds long in AG? What operations take so long and why does it take longer in AG?
Linked Data: Can we take a quick look at the following tests to see why they might be so much slower in AG? (Just to see what they do and if there is perhaps a complex query that can be optimized.) test_xml_literal_serialization; test_ontology_delete; test_submission_parse; test_automaster_from_zip; and test_target_class. There are many more that are also more expensive, but I'd like to see if there is an obvious pattern in the top ones.

alexskr commented 3 years ago

we did not detect any memory pressure on the system running Jenkins unit test; there was no swapping of memory on the system level at any time.

'Total Jenkins-Only Time' table column name is not very clear and might be a bit misleading since it doesn't really have anything to do anything with Jenkins itself. Jenkins does some setup before running unit test and then clean up after itself but that number is meaningless in this benchmarking content and is not reflected anywhere in those spreadsheets. You should see that same behavior even if you run unit tests outside of Jenkins environment.

Perhaps more verbose output should be added to the unit test script to indicate how long it takes to load test data into the tripple store and do other prep work for individual unit tests.

I vaguely recall that AG has an internal indexing mechanism that is run in the background after the data is loaded so im guessing that either data load in AG is more expensive or AG is not as performant before indexing is fully completed or data never gets warm enough. To test that; perhaps the initial data load process should incorporate indexing of AG data if there is such a thing and/or run multiple unit tests against the same data without reloading it every time from scratch. If subsequent tests improve then we could have a potential explanation. However, do we really want to do that? I mean there is some value in it but do we really want to treat unit tests as benchmarking tool? Also, how would various caching layers play into the real performance of a live system? I think that perhaps more effort should be invested in developing a dedicated benchmarking/performance test utility than trying to shoehorn unit tests to do comprehensive benchmarking for us.

mdorf commented 3 years ago

The "Jenkins-Only Time" is a bit misleading. I didn't want to rename it to keep the naming consistent with the original benchmarking results. The numbers have little to do with Jenkins and much to do with the bootstrapping of the tests themselves. For example, when I execute a single test locally, the difference between the total run time and the actual test runtime becomes greater, since the bootstrapping process needs to complete regardless of how many tests are being run:

4store
------
TestOntologySubmissionsController#test_download_acl_only = 2.24 s = .
Finished tests in 17.570051s, 0.0569 tests/s, 0.1707 assertions/s.

AG
---
TestOntologySubmissionsController#test_download_acl_only = 1.62 s = .
Finished tests in 12.177920s, 0.0821 tests/s, 0.2463 assertions/s.

mdorf commented 3 years ago

test_download_acl_only appears twice because the same name is used in the TestOntologiesController and TestOntologySubmissionsController. The report header strips the the controller name, so it looks like the same test is run twice.

I've compared the code in both instances of this test, and the main difference between them is that TestOntologiesController#test_download_acl_only both creates an ontology/submission and parses it, where as TestOntologySubmissionsController#test_download_acl_only just creates one. So it appears that whenever parsing is involved AG performance dips:

TestOntologiesController#test_download_acl_only
-----------------------------------------------
ont = create_ontologies_and_submissions(ont_count: 1, submission_count: 1, process_submission: true)[2].first

4store
------
TestOntologiesController#test_download_acl_only = 30.93 s = .
Finished tests in 47.084130s, 0.0212 tests/s, 0.0637 assertions/s.

AG
--
TestOntologiesController#test_download_acl_only = 40.38 s = .
Finished tests in 51.964010s, 0.0192 tests/s, 0.0577 assertions/s.

TestOntologySubmissionsController#test_download_acl_only
--------------------------------------------------------
count, created_ont_acronyms, onts = create_ontologies_and_submissions(ont_count: 1, submission_count: 1, process_submission: false)

4store
------
TestOntologySubmissionsController#test_download_acl_only = 2.24 s = .
Finished tests in 17.570051s, 0.0569 tests/s, 0.1707 assertions/s.

AG
--
TestOntologySubmissionsController#test_download_acl_only = 1.62 s = .
Finished tests in 12.177920s, 0.0821 tests/s, 0.2463 assertions/s.

graybeal commented 3 years ago

Good points about setup time being in 'Jenkins only', and of course it depends on the triple store. That's too bad, it suggests there are things about AG that can be significant time sinks.

Good news about no memory pressure, thanks.

Internal indexing could also have some interesting impacts, hadn't thought of that possibility.

Thanks for explaining the 'duplicated test'. I'm not sure we understand yet what's happening in columns BL and BM for 4store.

graybeal commented 3 years ago

To these points (and the possible impacts of caching, setup times, initial indexing, 'warmup times', actual loads in the production system, and other things on test results):

do we really want to treat unit tests as benchmarking tool? I think that perhaps more effort should be invested in developing a dedicated benchmarking/performance test utility than trying to shoehorn unit tests to do comprehensive benchmarking for us.

I don't think the meaningful choice is between comprehensive benchmarking and what we're doing. Comprehensive benchmarking would be intensely time-consuming, and would inevitably test only a narrow slice of the actual use cases, which might or might not be the use cases we end up caring about. I just don't think we could do a good job of it in less than 3-4 person-months, even if we wanted to do it, and even that wouldn't prove AG would work.

So I claim the benchmarking choices are more like these: (A) Do no benchmarking at all. (B) Do a quick benchmark using existing measurable code and take the results at face value. (C) Do a quick benchmark using existing measurable code, and take a quick but close look at the results of that benchmark, to see if we can identify likely issues that would either (i) make the benchmark invalid, or (ii) make the installation of AG on the production system undesirable, unless those issues are fixed.

I didn't think (A) was a reasonable path, because it wouldn't give us the chance to see even major issues with our interface to AG. Similarly, I don't think (B) would give us a chance to spot the most troublesome issues the transition might face.

So I'm proposing something in the neighborhood of (C), where we do a reality check to look for odd or particularly negative outcomes. This ticket grabbed the most likely culprits that I saw after an hour or so of looking at the values.

graybeal commented 3 years ago

Interesting that AG looks 30% faster at creating the ontology, and 30% slower at parsing one.

graybeal commented 3 years ago

Based on Misha's comments in our call today, I think item 4 in the ticket may be addressed by noting most of the tests in which AG performs particularly poorly involve parsing (and submitting?) ontologies. A question remains in that case, what is the core activity which is taking longer in AG, and can we distill that in an example for Franz folks to take a look at?

The two things that are still open in my mind are

(2b) if you look at columns BL and BM for 4store results, you will see that the value is always either around 3-4 or around 30, but there is no consistency—each pair of values in BL and BM have one value around 3-4 and the other around 30, but which value is in which column varies. (3) In API: Why do the test_download_ontology tests, and one of the test_download_acl_only tests, take 25-30 seconds long in 4store and 60 seconds long in AG? What operations take so long and why does it take longer in AG?