googledatalab / pydatalab

Google Datalab Library
Apache License 2.0
195 stars 81 forks source link

Simplify loading DataFrames from GCS #471

Open nikhilk opened 7 years ago

nikhilk commented 7 years ago

Current (as a 2-step process):

%storage read --object "gs://bucket/path/to/csv" --variable temp_str
df = pd.read_csv(StringIO(temp_str))

Proposed (one step, plus avoid creating a copy of the data in a temp string):

%storage read --object "gs://bucket/path/to/csv" --dataframe df
parthea commented 7 years ago

Would the file format of the GCS object also be required, for example csv, or would that be inferred from the object file extension? Could you also specify a delimiter as an option?

brandondutra commented 7 years ago

This can be done in a 1 step process without a temp and in 1 cell currently! And it's faster than the two step way.

The TF way:

%time
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df2 = pd.read_csv(f)
CPU times: user 5.41 s, sys: 789 ms, total: 6.2 s
Wall time: 7.06 s

The current 2 step way:

%time
%storage read --object "gs://bucket/file.csv" --variable temp_str
CPU times: user 563 ms, sys: 600 ms, total: 1.16 s
Wall time: 2.68 s
%time
df1 = pd.read_csv(StringIO(temp_str))
CPU times: user 4.49 s, sys: 404 ms, total: 4.89 s
Wall time: 5.1 s

The two wall times add up to something larger than the TF way. The file was made with

with file_io.FileIO('gs://bucket/file.csv', 'w') as f:
  for i in range(10**7):
    f.write('hello,world,abcdefghijklm,nopqrstuv\n')

and totals 343.32 MiB. So maybe this is small to medium size.

Instead of adding to %storage to do more complicated things, what about showing TF + gsutil usage? Yes it is a little more complicated, but it also works outside of DL.

nikhilk commented 7 years ago

@parthea yes, there would presumably be some options required (hopefully everything can have defaults and/or convention based such as file extension) that could be specified as a yaml config in the cell body.

@brandondutra Yes, it makes sense that reading from a stream will be faster, and your experiment suggests we should do better. However, I am personally not sure having someone trying to use BigQuery + csv + pandas should also have to learn TensorFlow (also, those particular file APIs are barely even documented). We should add streaming APIs in pydatalab so as to cover the programmatic use-cases as well.