Closed OrielResearchCure closed 6 years ago
They way it is working is the following: options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = PROJECT_ID google_cloud_options.job_name = JOB_NAME google_cloud_options.staging_location = '%s/staging' % BUCKET_URL google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL options.view_as(StandardOptions).runner = 'DataflowRunner' #'DataflowRunner' 'DirectRunner' options.view_as(SetupOptions).setup_file = "./setup.py" # has to be with this name and on the local machine - didn't work with gs:// path - file not found setup.py has a format that is publish apache beam github & documentation. You will only need to change the name of the library that you want to install on the worker machine. setup.py is relevant for python (or maybe pip) installation. For other types of libraries, there is option that called requirements.txt file that has similar process with different option property (I didnt try it)
Dataflow will generate a workflow.tar to install on the workers machine on their startup phase (you will be able to see it on the logs).
Not all version are being supported by apache beam. Right now, bigQuery client API 0.25.0 is supported. There is a documentation with the latest libraries version support.
I hope that this will be helpful. eilalan
Hello all, I am using the python code to run my pipeline. similar to the following:
I would like to add pandas-gbq package installation to my workers.
so I changes the first code line to the following:
options = PipelineOptions(flags = ["--requirements_file","/content/datalab/requirements.txt"])
The requirements.txt file was generated by:pip freeze > requirements.txt
But it fires the following error:
CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/dataflow-requirements-cache', '-r', 'requirements.txt', '--no-binary', ':all:']' returned non-zero exit status 1
Any advice what I am doing wrong? I dont have any issue running the pipeline without that addition. except that the worker is stack on the line that run code that uses pandas-gbq.
Many thanks for any advice.
eilalan