More bigquery Dataset extensions for dedupping

gfr10598 commented 6 years ago

Adds DestinationQuery() and Dedup()

This was previously reviewed as https://github.com/m-lab/etl/pull/360, but based on feedback decided to move it to the m-lab/go repo.

This change is

coveralls commented 6 years ago

Coverage decreased (-48.2%) to 32.867% when pulling a8c41967c062c75eeae81aab47cbff3664a40098 on dedup into 61205efb336e9b88efa0e13382954f6e39575cb4 on master.

coveralls commented 6 years ago

Coverage decreased (-48.2%) to 32.867% when pulling de037343246f4a7a6b66b647b5e9e23da0905052 on dedup into 61205efb336e9b88efa0e13382954f6e39575cb4 on master.

coveralls commented 6 years ago

Coverage decreased (-19.4%) to 32.867% when pulling b3ff35f65e04b11427c0047c5b422e33e335e247 on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

coveralls commented 6 years ago

Coverage decreased (-19.4%) to 32.867% when pulling 49552b34017197eda3132a83565e7c5060932d9e on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

coveralls commented 6 years ago

Coverage decreased (-19.4%) to 32.867% when pulling 4c6b873f0c3c593652b906634042479c44183b8e on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

coveralls commented 6 years ago

Coverage decreased (-5.4%) to 46.853% when pulling a7bc47412d2dd45028f16773a7e6299218857b2e on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

stephen-soltesz commented 6 years ago

I'm wondering if Dedup is generic enough to warrant including in the m-lab/go packages?

Reviewed 1 of 1 files at r4. Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.

bqext/dataset.go, line 166 at r5 (raw file):

///////////////////////////////////////////////////////////////////

// TODO - really should take the one that was parsed last, instead

If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?

bqext/dataset.go, line 192 at r5 (raw file):

  q := dsExt.DestinationQuery(queryString, dest.Table(table))
  if overwrite {
      q.QueryConfig.WriteDisposition = bigquery.WriteTruncate

Could write disposition be a parameter to DestinationQuery? The thinking being to possibly encapsulate more of the details of QueryConfig.

bqext/dataset_integration_test.go, line 87 at r5 (raw file):

// Note that a similar struct is defined in dataset.go, but this
// one is used for testing the QueryAndParse method.
type PartitionInfo struct {

Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized?

Or could we use bqext.PartitionInfo safely in the tests instead?

bqext/dataset_integration_test.go, line 140 at r5 (raw file):

}

func ClientOpts() []option.ClientOption {

clientOpts?

bqext/dataset_integration_test.go, line 143 at r5 (raw file):

  opts := []option.ClientOption{}
  if os.Getenv("TRAVIS") != "" {
      authOpt := option.WithCredentialsFile("../travis-testing.key")

This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here?

e.g. TRAVIS_CREDENTIALS_FILE?

bqext/dataset_integration_test.go, line 158 at r5 (raw file):


  // First check that source table has expected number of rows.
  // TestDedupSrc should have 6 rows, of which 4 should be unique.

This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves.

Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.

bqext/dataset_test.go, line 136 at r5 (raw file):

func TestDestinationQuery(t *testing.T) {
  // Create a dummy client.
  client := cloudtest.NewChannelClient(make(chan *http.Response, 10))

Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.

Comments from Reviewable

gfr10598 commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.

bqext/dataset.go, line 166 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?

It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it.

It definitely could be added as a parameter, so I'll add that now. Do we want to make it even more general by using some kind of opts... variadic parameter?

bqext/dataset.go, line 192 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Could write disposition be a parameter to `DestinationQuery`? The thinking being to possibly encapsulate more of the details of QueryConfig.

Hmm - let me think about that some more.

bqext/dataset_integration_test.go, line 87 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized? Or could we use `bqext.PartitionInfo` safely in the tests instead?

I'm not terribly concerned since this package is bqext_test. But I'll make it private.

bqext/dataset_integration_test.go, line 140 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

`clientOpts`?

Done.

bqext/dataset_integration_test.go, line 143 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here? e.g. `TRAVIS_CREDENTIALS_FILE`?

I'll add an issue. Probably pretty easy to do. https://github.com/m-lab/go/issues/11

bqext/dataset_integration_test.go, line 158 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves. Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.

Agree. Already created https://github.com/m-lab/go/issues/8 Added comment.

bqext/dataset_test.go, line 136 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.

Done.

Comments from Reviewable

coveralls commented 6 years ago

Coverage decreased (-5.4%) to 46.853% when pulling ad6366374909e40fe9fc98781961f664f8beedd6 on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

stephen-soltesz commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

bqext/dataset.go, line 166 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it. It definitely could be added as a parameter, so I'll add that now. Do we want to make it even more general by using some kind of opts... variadic parameter?

By ordering I meant as-in "parsed last".

When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier.

Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup?

If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like Filter or Record or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.

Comments from Reviewable

gfr10598 commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

bqext/dataset.go, line 166 at r5 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

By ordering I meant as-in "parsed last". When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier. Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup? If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like `Filter` or `Record` or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.

I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-(

Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies.

I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think.

I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think.

With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.

Comments from Reviewable

gfr10598 commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions.

bqext/dataset.go, line 192 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

Hmm - let me think about that some more.

I think it should be and will add that.

Comments from Reviewable

coveralls commented 6 years ago

Coverage decreased (-5.04%) to 47.183% when pulling bb80dce5156ec47b3d654694bd98468e9bcc8283 on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

coveralls commented 6 years ago

Coverage decreased (-4.3%) to 47.887% when pulling dce14580d937311f90d9cd5a997d3d82313f0b1e on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

coveralls commented 6 years ago

Coverage decreased (-4.3%) to 47.887% when pulling e9b7a3c717264bf91e28d51fabb6cc5198e52b25 on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

stephen-soltesz commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions.

bqext/dataset.go, line 166 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…

I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-( Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies. I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think. I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think. With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.

I like the separation between "exec query" and "dedup" now.

bqext/dataset.go, line 165 at r9 (raw file):


// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

nit: I would advocate for symmetry across names like DestinationQuery and ExecDestQuery. i.e. either both use the full name or both use the short names.

bqext/dataset.go, line 165 at r9 (raw file):


// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

destTable is a much better parameter than the three project/dataset/table parameters before. :+1:

bqext/dataset.go, line 165 at r9 (raw file):


// ExecDestQuery constructs a destination query, executes it, and returns status or error.
func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {

ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.

Comments from Reviewable

gfr10598 commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions.

bqext/dataset.go, line 165 at r9 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

nit: I would advocate for symmetry across names like `DestinationQuery` and `ExecDestQuery`. i.e. either both use the full name or both use the short names.

Done.

bqext/dataset.go, line 165 at r9 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.

Done.

Comments from Reviewable

gfr10598 commented 6 years ago

PTAL.

Comments from Reviewable

stephen-soltesz commented 6 years ago

Review status: 1 of 4 files reviewed at latest revision, 1 unresolved discussion.

bqext/dataset.go, line 144 at r10 (raw file):

}

// DestinationQuery constructs a query with common Config settings for

s/DestinationQuery/DestQuery/

Comments from Reviewable

coveralls commented 6 years ago

Coverage decreased (-6.3%) to 45.946% when pulling a8f36049b3dd900d1b336783641bbf048fc252f5 on dedup into 7ac696d6c9c9e759b395ed277bf974c39b47ab81 on master.

m-lab / go

More bigquery Dataset extensions for dedupping #10