Move discovery baseline files into a separate repo

garrettjonesgoogle commented 8 years ago

Only leave some basic testing files in toolkit. Having the complete set of files in toolkit makes reviewing a PR an enormous pain because it is so slow.

vchudnov-g commented 8 years ago

Then we should look at other ways of handling code reviews. I have a to-do item to follow up on this. I think it makes sense to have the baseline files here.

garrettjonesgoogle commented 8 years ago

For GAPIC, we handle code reviews by inspecting the baseline files for library.proto, which is a special testing API. This exercises the cases we care about. We don't put the entirety of our generated API x language surfaces into testdata/ for testing.

bjwatson commented 8 years ago

I noticed that yesterday. Seems like a lot of overhead to keep all those baseline files up to date whenever something is changed in discovery sample gen.

tcoffee-google commented 8 years ago

Because we don't control the client libraries based on discovery the way we do for GAPIC, we don't know how many cases there are that we may care about, so we cannot reasonably construct a comprehensive testing API. New APIs continue to turn up new bugs, so we baseline-test them all.

garrettjonesgoogle commented 8 years ago

@tcoffee-google sure, that's definitely something that will happen. But think about this scenario in other tools: do tests for gcc include all of the c and c++ code that exist in the world, to ensure that the compiler works for all cases? I haven't checked that case, but I'm reasonably confident that they don't. For GAPIC, we also don't include all APIs in our baselines, so we will discover broken cases outside of toolkit. The purpose of baseline tests is not to ensure that every single case in the world is correct, it is to ensure that known use cases are correct. Comprehensive testing should be done outside of toolkit, both for GAPIC (true already) and Discovery (not true now). Otherwise, changes are burdensome.

tcoffee-google commented 8 years ago

@garrettjonesgoogle I think there's a fundamental difference, in that C/C++ programs and GAPIC generated code comply with a known specification, whereas discovery-based client libraries do not. The sample-gen code we have for discovery was engineered by iterating against tests across all APIs until we had no further errors; we do not presently have any more explicit specification of what "known use cases" are. And had we made the effort to create one, it would already have been made obsolete by the most recent discovery API update we saw (for the Logging API). This effort will probably become worthwhile at some point, probably after we expand our API coverage, but for now, I think it will be easier to fix our reviewing tools, which will have many other benefits.

garrettjonesgoogle commented 8 years ago

@tcoffee-google , making toolkit responsible for for comprehensive Discovery coverage is conflating the notions of a toolkit and a consumer of toolkit. It was fine when we were starting out because we could iterate quickly, but the discovery stuff is a lot of noise and wasted time for non-discovery cases (e.g. running discovery tests on every build). I appreciate the difficulties you are facing with trying to cover discovery use cases, but it should not be in the scope of toolkit, it should be toolkit + a separate repo for comprehensive discovery tests.

tl;dr as things increase in scope and size, we need to separate out major divisions into separate repos.

tcoffee-google commented 8 years ago

@garrettjonesgoogle I appreciate the overhead this is causing for GAPIC. Assuming we try to refactor test coverage as soon as we reasonably can, I would prefer to add a separate build task for comprehensive testing rather than create a new repo. (We are already having issues keeping repos in sync, so I generally advocate for fewer rather than more.) Does that seem reasonable to you?

garrettjonesgoogle commented 8 years ago

Yes that would help a lot in the short term.

vchudnov-g commented 8 years ago

On the testing side: Ideally, our checkin tests should have some sort of logical trigger, so that if you change something that only affects Discovery, only those tests should be run, and if you change something that only affects GAPIC, only those tests should be run. So having a separate test task for the baseline tests is a step towards that. OTOH, as long as all our tests are reasonably fast, it's simpler to just run them all.

On the repo side: I've been thinking a lot about this, and I think there are two related issues. One is making sure discoGen works, so having a minimalist test suite that is functionally comprehensive (ie the different cases are each covered O(1), not necessarily combinatorially except for combinations that we know a priori are special). This tests that the tool itself works. These tests should live where the tool source lives.

The second issue is making sure that there are no surprises in the APIs that we generate, which I think was the original purpose of the baseline tests. However, do we still need these tests if we have the functional tests I suggest in the previous paragraph? It seems to me the only value we're getting out of these baseline tests is seeing how a change in the templates affects the output for every single API we bothered to include. It would be more compact to simply capture the output of a few (possibly synthetic) APIs in the functional tests: we would still see how the output changes in each of the patterns in each language. It seems like this is the only actionable information the baseline tests currently give us, and in a very repetitive form at that. Is there additional value in the baseline tests currently?

A potential third type of test is periodically running discoGen against the currently-published Discovery files just to make sure somebody hasn't started exercising a Discovery feature that we didn't know about. The purpose of these tests would be to ensure that discoGen succeeds, but not to capture or compare the output. This type of test should not be a checkin test (since the other tests handle our code breaking something, and this type of tests would instead check that unexpected input breaks us). It should run periodically in the background, maybe once daily. I think having this test live in this repo would be fine, though I'm open to arguments to house it elsewhere.

Thoughts?

garrettjonesgoogle commented 8 years ago

I agree with your statements for the most part as I understand them:

Minimal but functionally comprehensive test suite that lives with the source and runs at build time. (I think baseline-style tests are still good for this)
Periodically-run test suite against all currently-published Discovery files that is not triggered by the main build process or Travis, but is possible to invoke from the toolkit repo

I don't know that we need a trigger to narrow down build-time tests; we should just make sure that they scale with the code (not the number of APIs), and that they're all fast.

vchudnov-g commented 8 years ago

With your second summary point, I would clarify that the periodic tests against current Discovery files should be run automatically (so we get alerted reasonably soon after a problem arises) though with a long period (like maybe once nightly).

@tcoffee-google : What is your opinion on the additional value of the current baseline tests?

saicheems commented 7 years ago

Closing this under the assumption that the referenced issue is the solution we'll move towards

googleapis / gapic-generator

Move discovery baseline files into a separate repo #201