googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
975 stars 249 forks source link

google.cloud.bigquery version when running on local runner / dataflow runner #2041

Closed OrielResearchCure closed 6 years ago

OrielResearchCure commented 6 years ago

Hi all,

I am running python pipeline with google.cloud.bigquery library. on the local runner, everything runs great bigquery.version is 0.28.0

on the dataflow runner, the version is 0.23.0 bigquery.version is 0.23.0 and there are many API changes between these versions.

What will be the best way to change the installed version on the workers? I was assuming the the worker has all the master machine libraries installed when the execution is done from datalab - is that true? I am not generating any requirements.txt, the execution is done through the run button on the datalab UI.

please help me solve that issue. Thanks, eilalan

yebrahim commented 6 years ago

Worker do not have all the master libraries. You'll need to specify the dependencies explicitly on workers. See: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.

Basically requirements.txt is one way. The other two ways are setup.py and specifying in options (such as https://github.com/googledatalab/pydatalab/blob/master/solutionbox/image_classification/mltoolbox/image/classification/_cloud.py#L72).

OrielResearchCure commented 6 years ago

Hi yebrahim,

Thank you for the quick response. my pipeline looks something like that. where is the call to setup.py should be added? when running from the command line, the --setup_file /path/to/setup.py is being used. where will be the call to setup when running from the datalab cell? options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = 'projectName' google_cloud_options.job_name = 'label' google_cloud_options.staging_location = 'gs://staging' google_cloud_options.temp_location = 'gs://temp' options.view_as(StandardOptions).runner = 'DataflowRunner'

p = beam.Pipeline(options=options) (p | "Extract the rows from dataframe" >> beam.io.Read(beam.io.BigQuerySource('table')) | "create more columns" >> beam.ParDo(CreateColForSampleFn()) | 'writing to TSV files' >> beam.io.WriteToText('outputPath',file_name_suffix='.tsv'))

p.run().wait_until_finish()

Many thanks, eilalan

OrielResearchCure commented 6 years ago

sorry. just saw the option way. exactly what i was looking for Many thanks!!!!

yebrahim commented 6 years ago

Sure thing. Closing this issue.

OrielResearchCure commented 6 years ago

Just few more clarifications: Should it work with google options? The following code fired an attribute error google_cloud_options.extra_package = 'google-cloud-bigquery' The error was: 'GoogleCloudOptions' object has no attribute 'extra_package' it seems that the attribute extra_package does not exist there. is there other attribute to work with?

Or should i switch the code to the following as in the example? and does google-cloud-bigquery is enough or should i provide any additional path? information

options = { 'staging_location': os.path.join(output_dir, 'tmp', 'staging'), 'temp_location': os.path.join(output_dir, 'tmp'), 'job_name': job_name, 'project': _util.default_project(), 'extra_packages': local_packages, 'teardown_policy': 'TEARDOWN_ALWAYS', 'no_save_main_session': True } if pipeline_option is not None: options.update(pipeline_option)

qimingj commented 6 years ago

Try casting it to SetupOptions:

from apache_beam.options.pipeline_options import SetupOptions

options = PipelineOptions() setup_options = options.view_as(SetupOptions) setup_options.extra_packages = ['gs://bucket/1.tar.gz']

But in your case, your want a standard package available from PyPI, so you should use setup.py.

setup_options.setup_file = 'your_setup_py_local_or_gcs_path'.

And in your setup.py you can list google-cloud-bigquery as a dependency with its version.

OrielResearchCure commented 6 years ago

Thank you. I have error that I dont understand how it is related to the addition of the setup.py I have updated the pipeline options to the following:

options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = PROJECT_ID google_cloud_options.job_name = 'label--1' google_cloud_options.staging_location = 'gs://dir/staging' google_cloud_options.temp_location = 'gs://dir/temp' options.view_as(StandardOptions).runner = 'DataflowRunner' #'DataflowRunner' 'DirectRunner' options.view_as(SetupOptions).setup_file = "./setup.py" #

When I run it with DirectRunner, everything is fine. When I run it with DataflowRunner, the following error is being fired on the worker startup log from the stackdriver:

D Debug: validating 0:workflow.tar.gz [#0] undefined F Failed to install packages: failed to install workflow: exit status 1 undefined the setup.py code is copy of the apache-beam setup with the following ending: `REQUIRED_PACKAGES = [ 'google-cloud-bigquery==0.28.0', ]

setuptools.setup( name='orielresearch', version='0.0.1', description='oriel research set workflow package.', install_requires=REQUIRED_PACKAGES, packages=setuptools.find_packages(), cmdclass={

Command class instantiated and run during pip install scenarios.

    'build': build,
    'CustomCommands': CustomCommands,
    }
)`

What am i missing???

Thank you, eilalan

qimingj commented 6 years ago

The following setup.py works for me:

from setuptools import setup, find_packages

setup(
  name='trainer',
  version='1.0.0',
  packages=find_packages(),
  keywords=[
  ],
  license="Apache Software License",
  install_requires=[
    'tensorflow==1.7.0',
  ],
  package_data={
  },
  data_files=[],
)
OrielResearchCure commented 6 years ago

I like this one, much simpler. I got the same error of installation failure at the startup worker. I am trying to add more packages installanion in case there is any versions conflict:

install_requires=[ 'google-cloud==0.32.0', 'google-cloud-bigquery==0.28.0', 'google-cloud-core==0.28.1', 'google-cloud-dataflow==2.0.0', ],

I will update in case it is working. Please let me know if you have any other idea.

Thanks, Eila

OrielResearchCure commented 6 years ago

I was able to upgrade to version 'google-cloud-bigquery==0.25.0', The setup.py is fine. thank you for the help! there is an issue with upgrading to 'google-cloud-bigquery==0.28.0', it has a different API than the older ones that I would like to use. Let me know if you have an idea. in any case, I will update.