googledatalab / datalab

Interactive tools and developer experiences for Big Data on Google Cloud Platform.
Apache License 2.0
975 stars 249 forks source link

Running Apache beam pipeline on dataflow fires an error (DirectRunner running with no issue) #2066

Open OrielResearchCure opened 6 years ago

OrielResearchCure commented 6 years ago

Hi all,

Pipeline that was running perfectly fires an error when using dataflow. so I tried a simple pipeline and gets the same error. Please let me know if there is anything that I need to change / update in my environment or any other advice?

Many thanks, Eila

import  apache_beam  as  beam
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'PROJECT-ID'
google_cloud_options.job_name = 'try-debug'
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL #'gs://archs4/staging'
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL #'gs://archs4/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'  

p1 = beam.Pipeline(options=options)

(p1 | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt')
    | 'write' >> beam.io.WriteToText('gs://bucket/test.txt', num_shards=1)
 )

p1.run().wait_until_finish()

will fire the following error:

CalledProcessErrorTraceback (most recent call last)
<ipython-input-17-b4be63f7802f> in <module>()
      5  )
      6 
----> 7 p1.run().wait_until_finish()

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    174       finally:
    175         shutil.rmtree(tmpdir)
--> 176     return self.runner.run(self)
    177 
    178   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self, pipeline)
    250     # Create the job
    251     result = DataflowPipelineResult(
--> 252         self.dataflow_client.create_job(self.job), self)
    253 
    254     self._metrics = DataflowMetrics(self.dataflow_client, result, self.job)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    166       while True:
    167         try:
--> 168           return fun(*args, **kwargs)
    169         except Exception as exn:  # pylint: disable=broad-except
    170           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    423   def create_job(self, job):
    424     """Creates job description. May stage and/or submit for remote execution."""
--> 425     self.create_job_description(job)
    426 
    427     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    446     """Creates a job described by the workflow proto."""
    447     resources = dependency.stage_job_resources(
--> 448         job.options, file_copy=self._gcs_file_copy)
    449     job.proto.environment = Environment(
    450         packages=resources, options=job.options,

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in stage_job_resources(options, file_copy, build_setup_args, temp_dir, populate_requirements_cache)
    377       else:
    378         sdk_remote_location = setup_options.sdk_location
--> 379       _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    380       resources.append(names.DATAFLOW_SDK_TARBALL_FILE)
    381     else:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    462   elif sdk_remote_location == 'pypi':
    463     logging.info('Staging the SDK tarball from PyPI to %s', staged_path)
--> 464     _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
    465   else:
    466     raise RuntimeError(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _download_pypi_sdk_package(temp_dir)
    525       '--no-binary', ':all:', '--no-deps']
    526   logging.info('Executing command: %s', cmd_args)
--> 527   processes.check_call(cmd_args)
    528   zip_expected = os.path.join(
    529       temp_dir, '%s-%s.zip' % (package_name, version))

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processes.pyc in check_call(*args, **kwargs)
     42   if force_shell:
     43     kwargs['shell'] = True
---> 44   return subprocess.check_call(*args, **kwargs)
     45 
     46 

/usr/local/envs/py2env/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    188         if cmd is None:
    189             cmd = popenargs[0]
--> 190         raise CalledProcessError(retcode, cmd)
    191     return 0
    192 

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmpyyiizo', 'google-cloud-dataflow==2.0.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 2

What am I missing? has something changed?

Thanks, Eila

OrielResearchCure commented 6 years ago

The issue was with the pip version. --download was deprecated. I dont know where this need to be mentioned / fixed. running: pip install pip==9.0.3

solved the issue. thanks, eila

ojarjur commented 6 years ago

@OrielResearchCure It looks like the Apache Beam Dataflow runner is trying to install its dependencies under the hood and relying on an very old version of pip to do so.

That simply will not work in Datalab as we use a newer version of pip, and not all of our packages are installed via pip anyway: most are installed via Conda.

I would classify it as a bug in the Apache Beam library, but it looks like you can work around it.

Specifically, it seems like you can circumvent that bug by manually installing the dependencies yourself.

Run the following in a Code cell, and then restart your notebook's kernel:

%%bash
source activate py2env
conda install pytz==2018.4
pip install apache-beam google-cloud-dataflow
ex-code commented 6 years ago

I had the same problem (error) with DataflowRunner (DirectRunner worked normally). pip install pip==9.0.3 solved the problem for me as well !

psyyip commented 4 years ago

I have the same error.

CalledProcessError: Command '['/usr/bin/python3', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', 'requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1

This is my environment: apache-beam==2.16.0 tensorflow==2.1.0 tensorflow-metadata==0.15.2 tensorflow-transform==0.15.0 Python 2.7.13 pip 20.0.2

I think my pip is already updated. What am I missing?

Ark-kun commented 4 years ago

--no-binary Might be causing problems.

psyyip commented 4 years ago

--no-binary Might be causing problems.

@Ark-kun , what is your suggestion to fix it?

OrielResearchCure commented 4 years ago

I have moved to python 3 and would like to share the installation to save others the time:

!pip install --upgrade --force-reinstall pip==9.0.3 !pip install --upgrade virtualenv --disable-pip-version-check !pip install apache-beam --disable-pip-version-check !pip install apache-beam[gcp] --disable-pip-version-check !pip install apache-beam[test] --disable-pip-version-check !pip install apache-beam[interactive] --disable-pip-version-check !pip install --upgrade pip !pip install tensorflow=='2.0.0b1' # added tensorflow for machine and other nice units methods

I hope that this is helpful,

Best, eilalan

On Wed, Apr 29, 2020 at 11:05 AM psyyip notifications@github.com wrote:

--no-binary Might be causing problems.

@Ark-kun https://github.com/Ark-kun , what is your suggestion to fix it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/googledatalab/datalab/issues/2066#issuecomment-621270920, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHGYKZUED7JBLUWZANYHL5DRPA627ANCNFSM4FQ7PSQQ .

-- Eila http://www.orielresearch.com Meetup https://www.meetup.com/Deep-Learning-In-Production/

pa-nguyen commented 3 years ago

pip install pip==9.0.3

@OrielResearchCure Where do you put this?