googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
975 stars 249 forks source link

Help with adding python package dependencies to apache beam workers #2035

Closed OrielResearchCure closed 6 years ago

OrielResearchCure commented 6 years ago

Hello all, I am using the python code to run my pipeline. similar to the following:

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'my-project-id'
google_cloud_options.job_name = 'myjob'
google_cloud_options.staging_location = 'gs://your-bucket-name-here/staging'
google_cloud_options.temp_location = 'gs://your-bucket-name-here/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

I would like to add pandas-gbq package installation to my workers.

so I changes the first code line to the following: options = PipelineOptions(flags = ["--requirements_file","/content/datalab/requirements.txt"]) The requirements.txt file was generated by: pip freeze > requirements.txt

But it fires the following error: CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/dataflow-requirements-cache', '-r', 'requirements.txt', '--no-binary', ':all:']' returned non-zero exit status 1

Any advice what I am doing wrong? I dont have any issue running the pipeline without that addition. except that the worker is stack on the line that run code that uses pandas-gbq.

Many thanks for any advice.

eilalan

OrielResearchCure commented 6 years ago

They way it is working is the following: options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = PROJECT_ID google_cloud_options.job_name = JOB_NAME google_cloud_options.staging_location = '%s/staging' % BUCKET_URL google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL options.view_as(StandardOptions).runner = 'DataflowRunner' #'DataflowRunner' 'DirectRunner' options.view_as(SetupOptions).setup_file = "./setup.py" # has to be with this name and on the local machine - didn't work with gs:// path - file not found setup.py has a format that is publish apache beam github & documentation. You will only need to change the name of the library that you want to install on the worker machine. setup.py is relevant for python (or maybe pip) installation. For other types of libraries, there is option that called requirements.txt file that has similar process with different option property (I didnt try it)

Dataflow will generate a workflow.tar to install on the workers machine on their startup phase (you will be able to see it on the logs).

Not all version are being supported by apache beam. Right now, bigQuery client API 0.25.0 is supported. There is a documentation with the latest libraries version support.

I hope that this will be helpful. eilalan