Closed gfr10598 closed 6 years ago
I'm wondering if Dedup is generic enough to warrant including in the m-lab/go packages?
Reviewed 1 of 1 files at r4. Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.
bqext/dataset.go, line 166 at r5 (raw file):
/////////////////////////////////////////////////////////////////// // TODO - really should take the one that was parsed last, instead
If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?
bqext/dataset.go, line 192 at r5 (raw file):
q := dsExt.DestinationQuery(queryString, dest.Table(table)) if overwrite { q.QueryConfig.WriteDisposition = bigquery.WriteTruncate
Could write disposition be a parameter to DestinationQuery
? The thinking being to possibly encapsulate more of the details of QueryConfig.
bqext/dataset_integration_test.go, line 87 at r5 (raw file):
// Note that a similar struct is defined in dataset.go, but this // one is used for testing the QueryAndParse method. type PartitionInfo struct {
Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized?
Or could we use bqext.PartitionInfo
safely in the tests instead?
bqext/dataset_integration_test.go, line 140 at r5 (raw file):
} func ClientOpts() []option.ClientOption {
clientOpts
?
bqext/dataset_integration_test.go, line 143 at r5 (raw file):
opts := []option.ClientOption{} if os.Getenv("TRAVIS") != "" { authOpt := option.WithCredentialsFile("../travis-testing.key")
This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here?
e.g. TRAVIS_CREDENTIALS_FILE
?
bqext/dataset_integration_test.go, line 158 at r5 (raw file):
// First check that source table has expected number of rows. // TestDedupSrc should have 6 rows, of which 4 should be unique.
This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves.
Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.
bqext/dataset_test.go, line 136 at r5 (raw file):
func TestDestinationQuery(t *testing.T) { // Create a dummy client. client := cloudtest.NewChannelClient(make(chan *http.Response, 10))
Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 7 unresolved discussions.
bqext/dataset.go, line 166 at r5 (raw file):
If de-duplication depends on ordering of specific column names, such as "parse_time" does this belong in a generic package? Or, could this template be a parameter to Dedup?
It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it.
It definitely could be added as a parameter, so I'll add that now. Do we want to make it even more general by using some kind of opts... variadic parameter?
bqext/dataset.go, line 192 at r5 (raw file):
Could write disposition be a parameter to `DestinationQuery`? The thinking being to possibly encapsulate more of the details of QueryConfig.
Hmm - let me think about that some more.
bqext/dataset_integration_test.go, line 87 at r5 (raw file):
Since this is only a local test dependency, can we call this type something else to prevent confusion? Should the type name be uncapitalized? Or could we use `bqext.PartitionInfo` safely in the tests instead?
I'm not terribly concerned since this package is bqext_test. But I'll make it private.
bqext/dataset_integration_test.go, line 140 at r5 (raw file):
`clientOpts`?
Done.
bqext/dataset_integration_test.go, line 143 at r5 (raw file):
This relative path could become fragile. At run time could we set an env variable to the location and check for that variable and use the location here? e.g. `TRAVIS_CREDENTIALS_FILE`?
I'll add an issue. Probably pretty easy to do. https://github.com/m-lab/go/issues/11
bqext/dataset_integration_test.go, line 158 at r5 (raw file):
This test is very dependent on the state of an external project. This could discourage third-parties from considering adopting this for themselves. Since this is a live test anyway, what if this test constructed the 6-row table that would be du-duped by the rest of the test? This would make the setup self-contained in code. And, it would be possible to re-target the test with a single line change. I could even imagine reading an environment variable for the project name.
Agree. Already created https://github.com/m-lab/go/issues/8 Added comment.
bqext/dataset_test.go, line 136 at r5 (raw file):
Does this need a channel client? Can this use an 'ok client'? I don't see any pre-defined responses.
Done.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.
bqext/dataset.go, line 166 at r5 (raw file):
It does not depend on column ordering. It only will depend on things like parse_time or task_filename in order to perform some sanity checks. I'd prefer to leave it until later to generalize it. It definitely could be added as a parameter, so I'll add that now. Do we want to make it even more general by using some kind of opts... variadic parameter?
By ordering I meant as-in "parsed last".
When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier.
Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup?
If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like Filter
or Record
or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.
bqext/dataset.go, line 166 at r5 (raw file):
By ordering I meant as-in "parsed last". When I was reading this earlier, I think I was trying to imagine whether Dedup would be plausibly useful to a third-party. But, that may be too broad a criteria. If I imagine the goal for m-lab/go being to add generally useful routines that we may use across our own repos, then that's easier. Something still feels funny. If I ran Dedup twice on the same source, would I get the same result? Are there any other implicit dependencies between the input table and the implementation of Dedup? If dedupTemplate was a parameter, perhaps the functionality of Dedup becomes immediately more generic. The operation becomes something like `Filter` or `Record` or something -- taking data from a query and writing it to a new table, "dedup" being one application of it.
I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-(
Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies.
I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think.
I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think.
With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions.
bqext/dataset.go, line 192 at r5 (raw file):
Hmm - let me think about that some more.
I think it should be and will add that.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 5 unresolved discussions.
bqext/dataset.go, line 166 at r5 (raw file):
I'm definitely using the "useful to other repos in m-lab" criteria. Irene might use this, except she isn't using Go. 8-( Yes if you run Dedup twice you should get the same result. I do not believe there are any other implicit dependencies. I kinda like the idea of making this a Filter method, but I don't like the idea of injecting an arbitrary query. Then it starts to look like DestinationQuery, I think. I could, however, extract the bottom bit of the code into a generic ExecQuery or something like that, for queries with not output expected. Let me try that out and you can see what you think. With this extraction, I'd also be content moving the Dedup method itself back to the etl cmd/dedup package.
I like the separation between "exec query" and "dedup" now.
bqext/dataset.go, line 165 at r9 (raw file):
// ExecDestQuery constructs a destination query, executes it, and returns status or error. func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {
nit: I would advocate for symmetry across names like DestinationQuery
and ExecDestQuery
. i.e. either both use the full name or both use the short names.
bqext/dataset.go, line 165 at r9 (raw file):
// ExecDestQuery constructs a destination query, executes it, and returns status or error. func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {
destTable
is a much better parameter than the three project/dataset/table parameters before. :+1:
bqext/dataset.go, line 165 at r9 (raw file):
// ExecDestQuery constructs a destination query, executes it, and returns status or error. func (dsExt *Dataset) ExecDestQuery(query string, disposition bigquery.TableWriteDisposition, destTable *bigquery.Table) (*bigquery.JobStatus, error) {
ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 4 unresolved discussions.
bqext/dataset.go, line 165 at r9 (raw file):
nit: I would advocate for symmetry across names like `DestinationQuery` and `ExecDestQuery`. i.e. either both use the full name or both use the short names.
Done.
bqext/dataset.go, line 165 at r9 (raw file):
ExecDestQuery sounds like it executes a value returned by DestinationQuery. Is that possible? Running and Waiting for a query seems like a generally useful pattern.
Done.
Comments from Reviewable
Review status: 1 of 4 files reviewed at latest revision, 1 unresolved discussion.
bqext/dataset.go, line 144 at r10 (raw file):
} // DestinationQuery constructs a query with common Config settings for
s/DestinationQuery/DestQuery/
Comments from Reviewable
Adds DestinationQuery() and Dedup()
This was previously reviewed as https://github.com/m-lab/etl/pull/360, but based on feedback decided to move it to the m-lab/go repo.
This change is