googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
975 stars 249 forks source link

question about 'Evaluation and Batch Prediction' notebook in census tutorial #1191

Closed jiajiayb closed 7 years ago

jiajiayb commented 7 years ago

Hello! When I trying the 'Evaluation and Batch Prediction' notebook in census tutorial, I met an error and have no idea why this is happening. So the script I am using is exactly the same as the one in the tutorial, which is as follows:

import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.analysis as analysis
import google.cloud.ml.io as io
import json
import os

def extract_values((example, prediction)):
    import tensorflow as tf
    tf_example = tf.train.Example()
    tf_example.ParseFromString(example.values()[0])
    feature_map = tf_example.features.feature
    values = {'target': feature_map['target'].float_list.value[0]}
    values.update(prediction)
    return values

OUTPUT_DIR = '/content/datalab/tmp/ml/census/evaluate'
pipeline = beam.Pipeline('DirectPipelineRunner')

eval_features = (pipeline | 'ReadEval' >> io.LoadFeatures('/content/datalab/tmp/ml/census/preprocessed/features_eval*'))
trained_model = pipeline | 'LoadModel' >> io.LoadModel('/content/datalab/tmp/ml/census/model/model')
evaluations = (eval_features | 'Evaluate' >> ml.Evaluate(trained_model) |
beam.Map('ExtractEvaluationResults', extract_values))
eval_data_sink = beam.io.TextFileSink(os.path.join(OUTPUT_DIR, 'eval'), shard_name_template='')
evaluations | beam.io.textio.WriteToText(os.path.join(OUTPUT_DIR, 'eval'), shard_name_template='')

pipeline.run()

The error I get is as follow:

`ValueErrorTraceback (most recent call last) in () 36 # evaluation 37 ---> 38 eval_features = (pipeline | 'ReadEval' >> io.LoadFeatures('/content/datalab/tmp/ml/census/preprocessed/features_eval*')) 39 trained_model = pipeline | 'LoadModel' >> io.LoadModel('/content/datalab/tmp/ml/census/model/model') 40 evaluations = (eval_features | 'Evaluate' >> ml.Evaluate(trained_model) |

/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/ptransform.pyc in ror(self, pvalueish) 727 728 def ror(self, pvalueish): --> 729 return self.transform.ror(pvalueish, self.label) 730 731 def apply(self, pvalue):

/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/ptransform.pyc in ror(self, left, label) 435 pvalueish = _SetInputPValues().visit(pvalueish, replacements) 436 self.pipeline = p --> 437 result = p.apply(self, pvalueish, label) 438 if deferred: 439 return result

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label) 207 try: 208 old_label, transform.label = transform.label, label --> 209 return self.apply(transform, pvalueish) 210 finally: 211 transform.label = old_label

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label) 243 transform.type_check_inputs(pvalueish) 244 --> 245 pvalueish_result = self.runner.apply(transform, pvalueish) 246 247 if type_options is not None and type_options.pipeline_type_check:

/usr/local/lib/python2.7/dist-packages/apachebeam/runners/runner.pyc in apply(self, transform, input) 145 m = getattr(self, 'apply%s' % cls.name, None) 146 if m: --> 147 return m(transform, input) 148 raise NotImplementedError( 149 'Execution of [%s] not implemented in runner %s.' % (transform, self))

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply_PTransform(self, transform, input) 151 def apply_PTransform(self, transform, input): 152 # The base case of apply is to call the transform's apply. --> 153 return transform.apply(input) 154 155 def run_transform(self, transform_node):

/usr/local/lib/python2.7/dist-packages/google/cloud/ml/io/transforms.pyc in apply(self, pvalue) 148 file_pattern=self._file_pattern, 149 coder=mlcoders.ExampleProtoCoder(), --> 150 compression_type=self._compression_type)) 151 152

/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/ptransform.pyc in ror(self, left, label) 435 pvalueish = _SetInputPValues().visit(pvalueish, replacements) 436 self.pipeline = p --> 437 result = p.apply(self, pvalueish, label) 438 if deferred: 439 return result

/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.pyc in apply(self, transform, pvalueish, label) 243 transform.type_check_inputs(pvalueish) 244 --> 245 pvalueish_result = self.runner.apply(transform, pvalueish) 246 247 if type_options is not None and type_options.pipeline_type_check:

/usr/local/lib/python2.7/dist-packages/apachebeam/runners/runner.pyc in apply(self, transform, input) 145 m = getattr(self, 'apply%s' % cls.name, None) 146 if m: --> 147 return m(transform, input) 148 raise NotImplementedError( 149 'Execution of [%s] not implemented in runner %s.' % (transform, self))

/usr/local/lib/python2.7/dist-packages/apache_beam/runners/runner.pyc in apply_PTransform(self, transform, input) 151 def apply_PTransform(self, transform, input): 152 # The base case of apply is to call the transform's apply. --> 153 return transform.apply(input) 154 155 def run_transform(self, transform_node):

/usr/local/lib/python2.7/dist-packages/google/cloud/ml/dataflow/io/tfrecordio.pyc in apply(self, pvalue) 164 165 def apply(self, pvalue): --> 166 return pvalue.pipeline | beam.Read(_TFRecordSource(*self._args)) 167 168

/usr/local/lib/python2.7/dist-packages/google/cloud/ml/dataflow/io/tfrecordio.pyc in init(self, file_pattern, coder, compression_type) 115 file_pattern=file_pattern, 116 compression_type=compression_type, --> 117 splittable=False) 118 self._coder = coder 119

/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.pyc in init(self, file_pattern, min_bundle_size, compression_type, splittable) 74 75 if compression_type == fileio.CompressionTypes.AUTO: ---> 76 raise ValueError('FileBasedSource currently does not support ' 77 'CompressionTypes.AUTO. Please explicitly specify the ' 78 'compression type or use '

ValueError: FileBasedSource currently does not support CompressionTypes.AUTO. Please explicitly specify the compression type or use CompressionTypes.UNCOMPRESSED if file is uncompressed.`

Could you please help me have a look what might be the issue for generating the error message. Thanks a lot!

jiajiayb commented 7 years ago

I doubt if this 'ValueError' is caused by the older version of apache beam that is in datalab. Any suggestion on how to check the apache_beam module in datalab? And also any guide on how to install the apache_beam update in datalab? I will really appreciate there is any comment on how to debug this error. Thanks!

jiajiayb commented 7 years ago

When I run 'pip freeze' in datalab notebook, it indicates the google-cloud-dataflow version is 0.4.2, which I believe is an older version of apache beam can cause this 'ValueError'. Any suggestion on how to upgrade google-cloud-dataflow without interrupting other python package setting in datalab? Or please point out if you think my guess for this 'ValueError' is wrong. Thanks a lot!

jiajiayb commented 7 years ago

I tried '!pip install google-cloud-dataflow --upgrade' and '!pip install --upgrade --force-reinstall \ https://storage.googleapis.com/cloud-ml/sdk/cloudml.latest.tar.gz' for current running VM, but the same error pops up when I run the same command. Any suggestion on how to resolve this? I am pretty new to datalab and apache beam, please bear with me if this is a very stupid question. Thanks a lot!

jiajiayb commented 7 years ago

Hi @qimingj, I am sorry to bother you but I realized the machine learning tutorials are removed from most updated build. May I ask if an updated tutorial for machine learning will be online soon. Thanks a lot!

qimingj commented 7 years ago

Sorry for the late notice! We removed the previous machine learning notebooks because new ones are coming with new features. The old notebooks no longer works with latest tensorflow version.

jiajiayb commented 7 years ago

@qimingj Thank you so much for your response. I think the question I met for this thread was due to I used the older notebook on the new build. May I ask when could we expect to have the new release of machine learning tutorials. Thanks a lot!

qimingj commented 7 years ago

It will be very soon. :)

jiajiayb commented 7 years ago

Great! Thanks a lot! Have a good one :)

chmeyers commented 7 years ago

The new notebooks have been released now as part of the Datalab GA release.

jiajiayb commented 7 years ago

@chmeyers Great! Thank you so much for letting me know! I will have a try. Have a nice weekend!

jiajiayb commented 7 years ago

@chmeyers Hi, I am sorry to bother you. I was exploring the ML toolbox and Tensorflow folders in the most recent datalab GA release today. I realized there is a time series tutorial was implemented with tensorflow, but it is a little bit hard for me to apply that to my own problem. I just wonder if there is any possibility to have some tutorial documents that have tensorflow applied to census data or iris data by using the service (not only running in local) in the near future. Thanks a lot!

qimingj commented 7 years ago

Census and Iris samples are under "samples/ML Toolbox" directory. They both use "structured data" solution package that is implemented using Tensorflow. See https://github.com/googledatalab/pydatalab/tree/master/solutionbox/structured_data.

jiajiayb commented 7 years ago

@qimingj Thanks a lot for your response. I saw the census and iris examples under "samples/ML Toolbox" directory only have 'local end to end' document. I just wonder if I could find 'service end to end' tutorial for iris and census data. Thank you for your help :)

qimingj commented 7 years ago

For census data we do have service ones. https://github.com/googledatalab/notebooks/tree/master/samples/ML%20Toolbox/Regression/Census. The service runs are split into 4 notebooks, one for each step. We don't have an Iris service notebook but it should not be difficult to figure it out based on local run notebook.

jiajiayb commented 7 years ago

@qimingj great! I am trying the iris local one. One quick question, where could I find the document/ manual to describe the module for example I would like to explore what are parameters in 'mltoolbox.classification.dnn'. Thanks!

qimingj commented 7 years ago

You probably want to check the docstring of the functions (preprocess, train, predict, batch_predict). Just type function name followed by two question marks (such as dnn.train??), and execute it. Datalab should show you the help on the right pane, where you find all docstrings.

yebrahim commented 7 years ago

We should add the documentation for these modules under http://googledatalab.github.io/pydatalab/. I'll work on preparing this.

jiajiayb commented 7 years ago

@qimingj @yebrahim Great! Thank you so much for your response.

yebrahim commented 7 years ago

Updated. Please check out the new ML Toolbox section and let us know if any improvements are needed.

jiajiayb commented 7 years ago

@yebrahim This is very helpful! Thanks a lot!

qimingj commented 7 years ago

Thank you @yebrahim!

jiajiayb commented 7 years ago

@yebrahim @qimingj Sorry to bother you for another question. I am planing to use convolution neural network for classification problem. I did not find cnn was used in iris example, or described in mltoolbox.classification.dnn. Could you please guide me if there is a tutorial on using cnn for classification problem on dataset such as iris in datalab environment.

qimingj commented 7 years ago

I am not sure cnn is useful in iris example. The structured data solution provided in Datalab does not include convolutional network, but you can build one with Tensorflow. Check out Tensorflow example:

https://www.tensorflow.org/tutorials/deep_cnn