m-lab / etl-embargo

Apache License 2.0
0 stars 1 forks source link

Split #6

Closed yachang closed 7 years ago

yachang commented 7 years ago

Embargo cls, the third: do the tar file processing, split into 2 and save at different places.


This change is Reviewable

gfr10598 commented 7 years ago

Review status: 0 of 8 files reviewed at latest revision, 13 unresolved discussions.


embargo.go, line 44 at r4 (raw file):


// Split one tar files into 2 buffers.
func SplitFile(content io.Reader) (bool, bytes.Buffer, bytes.Buffer) {

Return an Error instead of a bool, and make it the last return value, consistent with other Go functions?


embargo.go, line 45 at r4 (raw file):

// Split one tar files into 2 buffers.
func SplitFile(content io.Reader) (bool, bytes.Buffer, bytes.Buffer) {
  var privateBuf bytes.Buffer

Maybe call this "embargoBuf"?


embargo.go, line 54 at r4 (raw file):

  }
  defer zipReader.Close()
  unzippedImage, err := ioutil.ReadAll(zipReader)

Maybe unzippedBytes ?


embargo.go, line 67 at r4 (raw file):

  publicTw := tar.NewWriter(publicGzw)

  // Handle one tar file

This comment is confusing. Is it handling one inner file from the tar file, or a top level tar file, or something else?


embargo.go, line 88 at r4 (raw file):

          // put this file to a private buffer
          if err := privateTw.WriteHeader(hdr); err != nil {
              fmt.Println(err)

This prints to the console. Is that what you want here? Or should it go to a log?


embargo.go, line 109 at r4 (raw file):

  if err := publicTw.Close(); err != nil {
      fmt.Println(err)
      return false, privateBuf, publicBuf

In general, I think it is much better to return the error, and decide what to do with it in the caller. This makes testing easier and more robust, and gives the caller more control.


embargo.go, line 140 at r4 (raw file):


// Embargo do embargo ckecking to all files in the sourceBucket.
func Embargo() bool {

I think we need a TODO here. If the embargo job crashes half way through, what should happen when it starts up again?


embargo_test.go, line 5 at r4 (raw file):

import (
  "bytes"
        "fmt"

Run gofmt?


embargo_test.go, line 12 at r4 (raw file):


/*
// End to end test, requires authentication.

Please add a clear TODO() here.


embargo_test.go, line 25 at r4 (raw file):

*/

func TestSplitTarFile(t *testing.T) {

I can figure out from context what is in the test files, but it would be better to state it explicitly in some combination of comment describing the test, and inline comments. If you define const strings for the filenames, the comments will be easy to read. E.g. "combined_file contains ... and ... . The test verifies that SplitFile correctly separates this into ... and ..., which ..."

An alternative (I think cleaner) way to test this would be to create a tar file on the fly, with lists of inner files, call SplitFile on it, then verify that the pub and private buffers contain the correct filenames.


embargo_test.go, line 39 at r4 (raw file):

      t.Error("Did not perform embargo ocrrectly.\n")
  }
  publicGolden, err := os.Open("testdata/20170315T000000Z-mlab3-sea03-sidestream-0000-p.tgz")

Does this need


embargo_test.go, line 44 at r4 (raw file):

  }
  defer publicGolden.Close()
  publicContent, err := ioutil.ReadAll(publicGolden)

Either check err value, or use _


embargo_test.go, line 56 at r4 (raw file):

  }
  defer privateGolden.Close()
  privateContent, err := ioutil.ReadAll(privateGolden)

Check err or use _


Comments from Reviewable

gfr10598 commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 21 unresolved discussions.


embargo.go, line 45 at r4 (raw file):

embargoBuf Oops. I meant s/privateBuf/embargoBuf/

However, the misunderstanding raises another question - which functions need to be exported, and which can be local to this file?


embargo.go, line 67 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
This comment is confusing. Is it handling one inner file from the tar file, or a top level tar file, or something else?

Still need a comment, it just should be clearer. 8-)


embargo.go, line 109 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
In general, I think it is much better to return the error, and decide what to do with it in the caller. This makes testing easier and more robust, and gives the caller more control.

Reminder - please "Done" each comment that you have completed.


embargo.go, line 10 at r5 (raw file):

  "io"
  "io/ioutil"
  "log"

FYI, I just discovered glog, which we will probably want to move to in the very near future, but this is fine for now.


embargo.go, line 27 at r5 (raw file):


// Write results to GCS.
func WriteResults(tarfileName string, service *storage.Service, privateBuf, publicBuf bytes.Buffer) bool {

Here and elsewhere, we should return err instead of bool to indicate whether something was successful.


embargo.go, line 30 at r5 (raw file):

  privateTarfileName := strings.Replace(tarfileName, ".tgz", "-e.tgz", -1)
  publicObject := &storage.Object{Name: tarfileName}
  privateObject := &storage.Object{Name: privateTarfileName}

s/privateObject/embargoObject/


embargo.go, line 62 at r5 (raw file):

  tarReader := tar.NewReader(unzippedReader)

  privateGzw := gzip.NewWriter(&privateBuf)

embargoGzw ?


embargo.go, line 64 at r5 (raw file):

  privateGzw := gzip.NewWriter(&privateBuf)
  publicGzw := gzip.NewWriter(&publicBuf)
  privateTw := tar.NewWriter(privateGzw)

embargoTw


embargo.go, line 125 at r5 (raw file):

}

// EmbargoOneTar process one tar file, split it to 2 files, the embargoed files

s/process/processes/ s/split/splits


embargo.go, line 127 at r5 (raw file):

// EmbargoOneTar process one tar file, split it to 2 files, the embargoed files
// will be saved in a private dir, and the unembargoed part will be save in a
// public dir.

s/dir/bucket/ ???


embargo.go, line 142 at r5 (raw file):

// TODO: handle midway crash. Since the source bucket is unchanged, if it failed
// in the middle, we just rerun it for that specific day.
func EmbargoOneDayData(date string) bool {

Prefer to return an error, not a bool.


embargo_test.go, line 12 at r5 (raw file):


// End to end test, requires authentication.
// TODO: enbale it on Travis.

typo.


gcs_operations.go, line 49 at r5 (raw file):

// Create a new bucket. Return true if it already exsits or is created successfully.
func CreateBucket(projectID string, bucketName string) bool {
  service := CreateService()

It is highly likely that creating a client and storage.New() are both heavyweight and perform requests to GCS backends. Unless you know otherwise, you should create an object that caches the client and the service, and reuses them for all GCS requests. Once.Do might be useful. https://golang.org/pkg/sync/#Once

I would cache the client and service in a Singleton object, then use these to get the buckets you need in

BTW, if you have some Go embargo code already running (e.g. to transform the old embargo data), you could pretty easily profile it to see how time is being spent. I would expect these to show up as places where the code is blocked waiting for the remote requests to complete. But I would just pro-actively change this to cache the client and service.

Since (IIUC) you are just calling CreateService once from embargo.go for each call to EmbargoOneDayData, I think that is fine for now. But you should add TODOs to each of these go_operations functions that is calling CreateService, as they will (likely) be quite inefficient.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 21 unresolved discussions.


embargo.go, line 10 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
FYI, I just discovered glog, which we will probably want to move to in the very near future, but this is fine for now.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 20 unresolved discussions.


embargo.go, line 27 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Here and elsewhere, we should return err instead of bool to indicate whether something was successful.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 20 unresolved discussions.


embargo.go, line 44 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Return an Error instead of a bool, and make it the last return value, consistent with other Go functions?

Done.


embargo.go, line 45 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
> embargoBuf Oops. I meant s/privateBuf/embargoBuf/ However, the misunderstanding raises another question - which functions need to be exported, and which can be local to this file?

I am not sure. Some use the gcs service defined in other files in this package.


embargo.go, line 30 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/privateObject/embargoObject/

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 21 unresolved discussions.


embargo.go, line 54 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Maybe unzippedBytes ?

Done.


embargo.go, line 67 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Still need a comment, it just should be clearer. 8-)

Done.


embargo.go, line 88 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
This prints to the console. Is that what you want here? Or should it go to a log?

Done.


embargo.go, line 109 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Reminder - please "Done" each comment that you have completed.

Done.


embargo.go, line 140 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
I think we need a TODO here. If the embargo job crashes half way through, what should happen when it starts up again?

Done.


embargo.go, line 62 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
embargoGzw ?

Done.


embargo.go, line 64 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
embargoTw

Done.


embargo.go, line 125 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/process/processes/ s/split/splits

Done.


embargo.go, line 127 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/dir/bucket/ ???

Done.


embargo.go, line 142 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Prefer to return an error, not a bool.

Done.


embargo_test.go, line 5 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Run gofmt?

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 20 unresolved discussions.


embargo_test.go, line 12 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Please add a clear TODO() here.

Done.


embargo_test.go, line 25 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
I can figure out from context what is in the test files, but it would be better to state it explicitly in some combination of comment describing the test, and inline comments. If you define const strings for the filenames, the comments will be easy to read. E.g. "combined_file contains ... and ... . The test verifies that SplitFile correctly separates this into ... and ..., which ..." An alternative (I think cleaner) way to test this would be to create a tar file on the fly, with lists of inner files, call SplitFile on it, then verify that the pub and private buffers contain the correct filenames.

Comments added. Add a TODO for the better way.


embargo_test.go, line 44 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Either check err value, or use _

Done.


embargo_test.go, line 56 at r4 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Check err or use _

Done.


embargo_test.go, line 12 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
typo.

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 20 unresolved discussions.


gcs_operations.go, line 49 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
It is highly likely that creating a client and storage.New() are both heavyweight and perform requests to GCS backends. Unless you know otherwise, you should create an object that caches the client and the service, and reuses them for all GCS requests. Once.Do might be useful. https://golang.org/pkg/sync/#Once I would cache the client and service in a Singleton object, then use these to get the buckets you need in BTW, if you have some Go embargo code already running (e.g. to transform the old embargo data), you could pretty easily profile it to see how time is being spent. I would expect these to show up as places where the code is blocked waiting for the remote requests to complete. But I would just pro-actively change this to cache the client and service. Since (IIUC) you are just calling CreateService once from embargo.go for each call to EmbargoOneDayData, I think that is fine for now. But you should add TODOs to each of these go_operations functions that is calling CreateService, as they will (likely) be quite inefficient.

Done.


Comments from Reviewable

gfr10598 commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo.go, line 46 at r6 (raw file):


// Split one tar files into 2 buffers.
func embargoBuf(content io.Reader) (bytes.Buffer, bytes.Buffer, error) {

SplitFile() was fine. Please change back?


embargo.go, line 133 at r6 (raw file):

// The private file will have a different name, so it can be copied to public
// bucket directly when it becomes one year old.
func EmbargoOneTar(content io.Reader, tarfileName string, service *storage.Service) bool {

return error ?


embargo.go, line 134 at r6 (raw file):

// bucket directly when it becomes one year old.
func EmbargoOneTar(content io.Reader, tarfileName string, service *storage.Service) bool {
  embargoBuf, publicBuf, err := embargoBuf(content)

Use the standard golang error idiom here:

if embargoBuf, publicBuf, err := embargoBuf(content); err != nil { return err } if WriteResults(...) != nil { return err }

return nil


embargo.go, line 155 at r6 (raw file):

  log.SetOutput(f)

  // TODO: Create service in a Singleton object, and reuses them for all GCS requests.

s/resuses/reuse/


embargo_check.go, line 98 at r6 (raw file):

  }
  fn := FileName{name: fileName}
  localIP := fn.GetLocalIP()

Consider having GetLocalIP return an error if the IP address is missing or unparseable.


embargo_test.go, line 26 at r6 (raw file):

}

// This test verified that func embargoBuf() correctly split the input tar

s/verified/verifies/

Make all test comments active - describing what the test does anytime you run it. Past tense implies that you ran the test once but might never run it again.


gcs_operations.go, line 49 at r5 (raw file):

Previously, yachang wrote…
Done.

Maybe some misunderstanding... embargo.go is fine now, but you should also add the same TODO for every function here that does service := CreateService().


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo.go, line 46 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
SplitFile() was fine. Please change back?

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo.go, line 133 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
return error ?

Done.


embargo.go, line 134 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Use the standard golang error idiom here: if embargoBuf, publicBuf, err := embargoBuf(content); err != nil { return err } if WriteResults(...) != nil { return err } return nil

If this way, embargoBuf, publicBuf will be local var, and cannot be used later.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo.go, line 155 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/resuses/reuse/

Done.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo_check.go, line 98 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Consider having GetLocalIP return an error if the IP address is missing or unparseable.

Currently it return empty string when something is wrong.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 10 unresolved discussions.


embargo_test.go, line 26 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/verified/verifies/ Make all test comments active - describing what the test does anytime you run it. Past tense implies that you ran the test once but might never run it again.

Done.


gcs_operations.go, line 49 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Maybe some misunderstanding... embargo.go is fine now, but you should also add the same TODO for every function here that does service := CreateService().

Done.


Comments from Reviewable

gfr10598 commented 7 years ago

embargo.go, line 134 at r6 (raw file):

Previously, yachang wrote…
If this way, embargoBuf, publicBuf will be local var, and cannot be used later.

Ah. In that case:

embargoBuf, publicBuf, err := embargoBuf(content) if err != nil { return err } if err = WriteResults(...); err != nil { return err }

return nil

The main thing is to prefer early return, on error, and reaching the end of the function should indicate success. https://blog.golang.org/error-handling-and-go


Comments from Reviewable

gfr10598 commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 4 unresolved discussions.


embargo.go, line 155 at r6 (raw file):

Previously, yachang wrote…
Done.

Still see "reuses"


embargo_check.go, line 98 at r6 (raw file):

Previously, yachang wrote…
Currently it return empty string when something is wrong.

Yes. An error is just more explicit and clear. Ok to leave as is though.


gcs_operations.go, line 49 at r5 (raw file):

Previously, yachang wrote…
Done.

Hmm. Not showing up in the PR. Did you commit it?


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 3 unresolved discussions.


embargo.go, line 134 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Ah. In that case: embargoBuf, publicBuf, err := embargoBuf(content) if err != nil { return err } if err = WriteResults(...); err != nil { return err } return nil The main thing is to prefer early return, on error, and reaching the end of the function should indicate success. https://blog.golang.org/error-handling-and-go

Done.


embargo.go, line 155 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Still see "reuses"

Done.


embargo_check.go, line 98 at r6 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Yes. An error is just more explicit and clear. Ok to leave as is though.

For old filename format w/o IP address, it is not an error, just we cannot get anything.


gcs_operations.go, line 49 at r5 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Hmm. Not showing up in the PR. Did you commit it?

Yes. In gcs_operations.go and embargo_check.go


Comments from Reviewable

gfr10598 commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion.


embargo.go, line 185 at r7 (raw file):

          return err
      }
      result := EmbargoOneTar(fileContent.Body, oneItem.Name, embargoService)

s/result/err/


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 9 files reviewed at latest revision, 1 unresolved discussion.


embargo.go, line 185 at r7 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
s/result/err/

Done.


Comments from Reviewable

gfr10598 commented 7 years ago

LGTM once you resolve new comments.

A high level comment for future - think about whether the functions you are writing would fit logically into one or two "classes" - i.e. structs to hold the state with functions operating on the structs. I find it very helpful to start by writing all the function signatures, then group them, and see if there is some implicit or explicit state that would form a coherence class. Then fill in the function bodies after you see how everything fits together.


Review status: 0 of 9 files reviewed at latest revision, 3 unresolved discussions.


embargo.go, line 71 at r8 (raw file):

  // Handle the small files inside one tar file.
  for {
      header, err := tarReader.Next()

Probably should be checking header.Typeflag to ensure it is a normal file.


embargo.go, line 80 at r8 (raw file):

      }
      basename := filepath.Base(header.Name)
      info := header.FileInfo()

Could some of these assignments be done through tar.FileInfoHeader()? Or through struct assignment?


Comments from Reviewable

gfr10598 commented 7 years ago

One more comment - please add a TODO in the travis file to set up coverage, and add badges to the readme file.

yachang commented 7 years ago

travis TODO added, and README updated.


Review status: 0 of 9 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


embargo.go, line 71 at r8 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Probably should be checking header.Typeflag to ensure it is a normal file.

Done.


embargo.go, line 80 at r8 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Could some of these assignments be done through tar.FileInfoHeader()? Or through struct assignment?

Yes, we can construct hdr in fly during using like this:

&tar.Header{ Name: header.Name, Mode:int64(info.Mode()), ModTime: info.ModTime(), Size: info.Size(), }

, but hdr will be used twice below, so I would prefer to construct it earlier to avoid duplication.


Comments from Reviewable

gfr10598 commented 7 years ago

embargo.go, line 87 at r8 (raw file):

      hdr.ModTime = info.ModTime()
      output, err := ioutil.ReadAll(tarReader)
      if strings.Contains(basename, "web100") && embargoCheck.ShouldEmbargo(basename) {

Incorporate the "web100" check within the ShouldEmbargo function.


Comments from Reviewable

gfr10598 commented 7 years ago

embargo.go, line 86 at r8 (raw file):

      hdr.Mode = int64(info.Mode())
      hdr.ModTime = info.ModTime()
      output, err := ioutil.ReadAll(tarReader)

This read causes a memory allocation and copy, which is wasteful if you then don't use it. So I would suggest you move this after the ShouldEmbargo check.


Comments from Reviewable

gfr10598 commented 7 years ago

embargo.go, line 86 at r8 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
This read causes a memory allocation and copy, which is wasteful if you then don't use it. So I would suggest you move this after the ShouldEmbargo check.

D'oh. Never mind.


Comments from Reviewable

yachang commented 7 years ago

Review status: 0 of 10 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


embargo.go, line 87 at r8 (raw file):

Previously, gfr10598 (Gregory Russell) wrote…
Incorporate the "web100" check within the ShouldEmbargo function.

Done.


Comments from Reviewable

gfr10598 commented 7 years ago

embargo.go, line 80 at r8 (raw file):

Previously, yachang wrote…
Yes, we can construct hdr in fly during using like this: &tar.Header{ Name: header.Name, Mode:int64(info.Mode()), ModTime: info.ModTime(), Size: info.Size(), } , but hdr will be used twice below, so I would prefer to construct it earlier to avoid duplication.

Discussed offline.


Comments from Reviewable

gfr10598 commented 7 years ago
:lgtm:

Comments from Reviewable