m-lab / etl-embargo

Apache License 2.0
0 stars 1 forks source link

Unembargo PR #7

Closed yachang closed 7 years ago

yachang commented 7 years ago

This change is Reviewable

stephen-soltesz commented 7 years ago

Reviewed 2 of 4 files at r1, 1 of 1 files at r2. Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions.


unembargo.go, line 44 at r3 (raw file):


import (
  "fmt"

Please use log for reporting errors (or Greg found glog recently https://github.com/golang/glog which looks even better).


unembargo.go, line 61 at r3 (raw file):

  current_time := time.Now().UTC().Format("2006-01-02")
  current_year, _ := strconv.Atoi(current_time[0:4])
  cutoff_date, _ := strconv.Atoi(strconv.Itoa(int(current_year-1)) + current_time[5:7] + current_time[8:10])

Since we want an int in the end, what do you think of using the time.Year, time.Month, and time.Day functions directly rather than converting to / from ascii strings?


unembargo.go, line 69 at r3 (raw file):


// The date is used as prefixFileName in format sidestream/yyyy/mm/dd
func UnEmbargoOneDayLegacyFiles(sourceBucket string, destBucket string, prefixFileName string) bool {

Please return actual error values. Returning an actual err value gives the caller more flexibility. If all they want is success / fail, then the caller can check err != nil. And, if the caller wants to know more they have the option of inspecting the error further.


unembargo.go, line 81 at r3 (raw file):

  for {
      destinationFiles := unembargoService.Objects.List(destBucket)
      if destPageToken != "" {

Is this condition necessary?


unembargo.go, line 101 at r3 (raw file):

  // Copy files.
  pageToken := ""
  for {

This function is pretty long. Is there a way to divide it into parts?


unembargo.go, line 118 at r3 (raw file):

              result := unembargoService.Objects.Delete(destBucket, oneItem.Name).Do()
              if result != nil {
                  fmt.Printf("Objects deletion from public bucket failed: %v\n", err)

What err value does this belong to?


unembargo.go, line 125 at r3 (raw file):

              // Insert the object into destination bucket.
              object := &storage.Object{Name: oneItem.Name}
              _, err := unembargoService.Objects.Insert(destBucket, object).Media(fileContent.Body).Do()

Is ObjectsService.Copy an option rather than Download & Insert? https://godoc.org/google.golang.org/api/storage/v1#ObjectsService.Copy


unembargo.go, line 148 at r3 (raw file):


// The input date is integer in format yyyymmdd
func Unembargo(date int) bool {

What will happen if I pass this function a value of zero?


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions.


unembargo.go, line 44 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Please use `log` for reporting errors (or Greg found `glog` recently https://github.com/golang/glog which looks even better).

Done.


unembargo.go, line 61 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Since we want an `int` in the end, what do you think of using the `time.Year`, `time.Month`, and `time.Day` functions directly rather than converting to / from ascii strings?

It is possible, but won't make code more simply:

time.Year 10000 + time.Month 100 + time.Day


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions.


unembargo.go, line 69 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Please return actual `error` values. Returning an actual `err` value gives the caller more flexibility. If all they want is success / fail, then the caller can check `err != nil`. And, if the caller wants to know more they have the option of inspecting the error further.

Done.


unembargo.go, line 81 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Is this condition necessary?

Given there are < 1000 files for sidestream everyday, we do not need this.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions.


unembargo.go, line 118 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
What `err` value does this belong to?

Done.


unembargo.go, line 148 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
What will happen if I pass this function a value of zero?

I will add a TODO to it.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 5 unresolved discussions.


unembargo.go, line 101 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
This function is pretty long. Is there a way to divide it into parts?

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 5 unresolved discussions.


unembargo.go, line 125 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Is `ObjectsService.Copy` an option rather than `Download` & `Insert`? https://godoc.org/google.golang.org/api/storage/v1#ObjectsService.Copy

I tried:

if oneItem.Name != "" { object := &storage.Object{Name: oneItem.Name} _, err := unembargoService.Objects.Copy(sourceBucket, oneItem.Name, destBucket, oneItem.Name, object).Do() if err != nil { log.Printf("Objects copy failed: %v\n", err) return err } }

But keep get errors like:

Objects copy failed: googleapi: Error 400: Required, required

Not sure the exact reason, but I will stick with Download & Insert ways.


Comments from Reviewable

stephen-soltesz commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


unembargo.go, line 61 at r3 (raw file):

Previously, yachang wrote…
It is possible, but won't make code more simply: time.Year * 10000 + time.Month * 100 + time.Day

The time.Year * 10000 + time.Month * 100 + time.Day expression is a single line and does not require converting from int to string back to int. Please use ints to calculate the cutoff_date.


unembargo.go, line 81 at r3 (raw file):

Previously, yachang wrote…
Given there are < 1000 files for sidestream everyday, we do not need this.

Setting the PageToken when the initial value is empty should be a no-op. So, I believe this check is not necessary and removing the extra condition will simplify the logic. Please remove the condition unless this breaks functionality.


unembargo.go, line 125 at r3 (raw file):

Previously, yachang wrote…
I tried: if oneItem.Name != "" { object := &storage.Object{Name: oneItem.Name} _, err := unembargoService.Objects.Copy(sourceBucket, oneItem.Name, destBucket, oneItem.Name, object).Do() if err != nil { log.Printf("Objects copy failed: %v\n", err) return err } } But keep get errors like: Objects copy failed: googleapi: Error 400: Required, required Not sure the exact reason, but I will stick with Download & Insert ways.

Hmm, that's too bad. The 400 error gives hope that it may be an issue that we can figure out. A Copy seems much more preferable to a Download and Upload, for performance and IO reasons.

Is there time to investigate? If not, please create a github issue to investigate this. (We will be running this code continuously in the future right?)


unembargo.go, line 148 at r3 (raw file):

Previously, yachang wrote…
I will add a TODO to it.

Please associate a name with the TODO.


unembargo_test.go, line 31 at r4 (raw file):

  UploadFile(privateBucket, "testdata/20160102T000000Z-mlab3-sin01-sidestream-0000.tgz", "sidestream/2016/01/02/")
  DeleteFiles(publicBucket, "")
  if Unembargo(20160102) == nil {

Rather than a positive condition here, use a negative condition. This way if it fails, the corresponding error message is right next to the condition that failed. And, the logic below does not need to be indented as far.


unembargo_test.go, line 33 at r4 (raw file):

  if Unembargo(20160102) == nil {
      // Check the privateBucket does not have that file any more
      fileNames := GetFileNamesFromBucket(privateBucket)

Rather than fileNames and fileNames2, what about privateNames and publicNames?


unembargo_test.go, line 34 at r4 (raw file):

      // Check the privateBucket does not have that file any more
      fileNames := GetFileNamesFromBucket(privateBucket)
      for _, fileName := range fileNames {

At this point, the private bucket should contain no files. And, at most could contain only one file.

It would be more concise to check the fileNames map for the explicit name expected. e.g.

if _, ok := fileNames["sidestream/2016/01/02/20160102T000000Z-mlab3-sin01-sidestream-0000.tgz"]; !ok {
    t.Errorf("The embargoed copy should be deleted after the process.\n")
}

Or, simply asserting that the length of len(fileNames) != 0 should be sufficient.


unembargo_test.go, line 41 at r4 (raw file):

      // Check the publicBucket has that file
      fileNames2 := GetFileNamesFromBucket(publicBucket)
      for _, fileName2 := range fileNames2 {

Here too; there should be exactly one file, and we know exactly what it's name should be.

Also, I suggest using a negative condition, so the error is right next to the condition.

It's unusual to return from a test function before the end of the function. A successful test should fall through to the end.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


unembargo.go, line 61 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
The `time.Year * 10000 + time.Month * 100 + time.Day` expression is a single line and does not require converting from int to string back to int. Please use `int`s to calculate the `cutoff_date`.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


unembargo.go, line 81 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Setting the `PageToken` when the initial value is empty should be a no-op. So, I believe this check is not necessary and removing the extra condition will simplify the logic. Please remove the condition unless this breaks functionality.

Done.


unembargo.go, line 125 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Hmm, that's too bad. The 400 error gives hope that it may be an issue that we can figure out. A `Copy` seems much more preferable to a Download and Upload, for performance and IO reasons. Is there time to investigate? If not, please create a github issue to investigate this. (We will be running this code continuously in the future right?)

Add a TODO for it.


Comments from Reviewable

yachang commented 7 years ago

Review status: 3 of 5 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


unembargo_test.go, line 31 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Rather than a positive condition here, use a negative condition. This way if it fails, the corresponding error message is right next to the condition that failed. And, the logic below does not need to be indented as far.

Done.


unembargo_test.go, line 33 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Rather than `fileNames` and `fileNames2`, what about `privateNames` and `publicNames`?

Done.


unembargo_test.go, line 34 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
At this point, the private bucket should contain no files. And, at most could contain only one file. It would be more concise to check the `fileNames` map for the explicit name expected. e.g. ``` if _, ok := fileNames["sidestream/2016/01/02/20160102T000000Z-mlab3-sin01-sidestream-0000.tgz"]; !ok { t.Errorf("The embargoed copy should be deleted after the process.\n") } ``` Or, simply asserting that the length of `len(fileNames) != 0` should be sufficient.

Done.


unembargo_test.go, line 41 at r4 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Here too; there should be exactly one file, and we know exactly what it's name should be. Also, I suggest using a negative condition, so the error is right next to the condition. It's unusual to return from a test function before the end of the function. A successful test should fall through to the end.

Done.


Comments from Reviewable

stephen-soltesz commented 7 years ago

Hi, Ya,

I see that there are new files in testdata, but none of the current tests use them. Is that expected?

Best, Stephen


Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 81 at r3 (raw file):

Previously, yachang wrote…
Done.

I think there has been a misunderstanding.

Using NextPageToken to iterate through all files is necessary anytime we call List, since we may have days with more than 1000 files, or for special test cases, or maybe GCS changes how many entries they return per request, right?

The "condition" I meant was, rather than:

if destPageToken != "" {
    destinationFiles.PageToken(destPageToken)
}

It should be sufficient to say (without the if):

destinationFiles.PageToken(destPageToken)

unembargo.go, line 125 at r3 (raw file):

Previously, yachang wrote…
Add a TODO for it.

Who is assigned to the TODO?

Will we be running this code continuously in the future? The efficiency of a Copy will be much higher than a Download and Upload.


unembargo.go, line 148 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Please associate a name with the TODO.

Who is assigned to the TODO?


unembargo.go, line 55 at r6 (raw file):

var (
  privateBucket = ""
  publicBucket  = ""

To make the unembargo interface more "object oriented" and rather than using global variables to pass parameters to the Unembargo function, what do you think of defining a simple struct to hold these values and defining Unembargo as a receiver on that struct type?

Client code might look like:

   config := embargo.NewConfig(privateBucketName, publicBucketName)
   config.Unembargo(date)

unembargo.go, line 62 at r6 (raw file):

func CheckWhetherUnembargo(date int) bool {
  current_time := time.Now()
  cutoff_date := (int(current_time.Year())-1)*10000 + int(current_time.Month())*100 + int(current_time.Day())

Please only use type casts when necessary. Like this: https://play.golang.org/p/W1dPo1IBvX

Also, I'm sorry to have missed it before; please use use mixed case variable names rather than underscores. This is the convention in Go https://golang.org/doc/effective_go.html#mixed-caps


unembargo.go, line 108 at r6 (raw file):

      sourceFiles := unembargoService.Objects.List(sourceBucket)
      sourceFiles.Prefix(prefixFileName)
      if pageToken != "" {

Here too, if pageToken != "" { should not be needed.


unembargo_test.go, line 24 at r6 (raw file):

// This end to end test require anthentication.
func TestUnembargoLegacy(t *testing.T) {
  privateBucket = "mlab-embargoed-data"

Since these are real buckets, I think it would be helpful and maybe safer if they included strings that made it clear that they're for "unit test" data and not production data.

These names look pretty generic and I can easily imagine them being confused for buckets for real data.


Comments from Reviewable

yachang commented 7 years ago

Which file? I see all test files used at least once.


Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


Comments from Reviewable

yachang commented 7 years ago

Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 148 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Who is assigned to the TODO?

Done.


unembargo_test.go, line 24 at r6 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Since these are real buckets, I think it would be helpful and maybe safer if they included strings that made it clear that they're for "unit test" data and not production data. These names look pretty generic and I can easily imagine them being confused for buckets for real data.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 108 at r6 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Here too, `if pageToken != "" {` should not be needed.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 81 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
I think there has been a misunderstanding. Using `NextPageToken` to iterate through all files is necessary anytime we call `List`, since we may have days with more than 1000 files, or for special test cases, or maybe GCS changes how many entries they return per request, right? The "condition" I meant was, rather than: ``` if destPageToken != "" { destinationFiles.PageToken(destPageToken) } ``` It should be sufficient to say (without the `if`): ``` destinationFiles.PageToken(destPageToken) ```

Done.


unembargo.go, line 62 at r6 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Please only use type casts when necessary. Like this: https://play.golang.org/p/W1dPo1IBvX Also, I'm sorry to have missed it before; please use use mixed case variable names rather than underscores. This is the convention in Go https://golang.org/doc/effective_go.html#mixed-caps

If I do not use type casts, there will be errors like:

mismatched types int and time.Month


Comments from Reviewable

yachang commented 7 years ago

This PR has been messed up by rebase. I will address all comments in a new branch. Then close this one.


Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


Comments from Reviewable

yachang commented 7 years ago

Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 55 at r6 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
To make the unembargo interface more "object oriented" and rather than using global variables to pass parameters to the `Unembargo` function, what do you think of defining a simple struct to hold these values and defining `Unembargo` as a receiver on that struct type? Client code might look like: ``` config := embargo.NewConfig(privateBucketName, publicBucketName) config.Unembargo(date) ```

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 1 of 13 files reviewed at latest revision, 7 unresolved discussions.


unembargo.go, line 125 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…
Who is assigned to the TODO? Will we be running this code continuously in the future? The efficiency of a Copy will be much higher than a Download and Upload.

Done.


Comments from Reviewable