kubeflow / examples

A repository to host extended examples and tutorials
Apache License 2.0
1.39k stars 751 forks source link

[code_search] Dataflow Job for Pre-Processing doesn't create dataset on BigQuery #297

Closed connected-bsamadi closed 5 years ago

connected-bsamadi commented 5 years ago

The code_search.dataflow.cli.preprocess_github_dataset command finishes successfully but it doesn't create a dataset on BigQuery. This is the output of the command:

running sdist
running egg_info
writing requirements to code_search.egg-info/requires.txt
writing code_search.egg-info/PKG-INFO
writing top-level names to code_search.egg-info/top_level.txt
writing dependency_links to code_search.egg-info/dependency_links.txt
reading manifest file 'code_search.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'code_search.egg-info/SOURCES.txt'
running check
creating code-search-0.1.dev0
creating code-search-0.1.dev0/code_search
creating code-search-0.1.dev0/code_search.egg-info
creating code-search-0.1.dev0/code_search/dataflow
creating code-search-0.1.dev0/code_search/dataflow/cli
creating code-search-0.1.dev0/code_search/dataflow/do_fns
creating code-search-0.1.dev0/code_search/dataflow/transforms
creating code-search-0.1.dev0/code_search/nmslib
creating code-search-0.1.dev0/code_search/nmslib/cli
creating code-search-0.1.dev0/code_search/t2t
copying files to code-search-0.1.dev0...
copying MANIFEST.in -> code-search-0.1.dev0
copying requirements.txt -> code-search-0.1.dev0
copying setup.py -> code-search-0.1.dev0
copying code_search/__init__.py -> code-search-0.1.dev0/code_search
copying code_search.egg-info/PKG-INFO -> code-search-0.1.dev0/code_search.egg-info
copying code_search.egg-info/SOURCES.txt -> code-search-0.1.dev0/code_search.egg-info
copying code_search.egg-info/dependency_links.txt -> code-search-0.1.dev0/code_search.egg-info
copying code_search.egg-info/requires.txt -> code-search-0.1.dev0/code_search.egg-info
copying code_search.egg-info/top_level.txt -> code-search-0.1.dev0/code_search.egg-info
copying code_search/dataflow/__init__.py -> code-search-0.1.dev0/code_search/dataflow
copying code_search/dataflow/utils.py -> code-search-0.1.dev0/code_search/dataflow
copying code_search/dataflow/cli/__init__.py -> code-search-0.1.dev0/code_search/dataflow/cli
copying code_search/dataflow/cli/arguments.py -> code-search-0.1.dev0/code_search/dataflow/cli
copying code_search/dataflow/cli/create_function_embeddings.py -> code-search-0.1.dev0/code_search/dataflow/cli
copying code_search/dataflow/cli/preprocess_github_dataset.py -> code-search-0.1.dev0/code_search/dataflow/cli
copying code_search/dataflow/do_fns/__init__.py -> code-search-0.1.dev0/code_search/dataflow/do_fns
copying code_search/dataflow/do_fns/dict_to_csv.py -> code-search-0.1.dev0/code_search/dataflow/do_fns
copying code_search/dataflow/do_fns/function_embeddings.py -> code-search-0.1.dev0/code_search/dataflow/do_fns
copying code_search/dataflow/do_fns/github_dataset.py -> code-search-0.1.dev0/code_search/dataflow/do_fns
copying code_search/dataflow/do_fns/prediction_do_fn.py -> code-search-0.1.dev0/code_search/dataflow/do_fns
copying code_search/dataflow/transforms/__init__.py -> code-search-0.1.dev0/code_search/dataflow/transforms
copying code_search/dataflow/transforms/bigquery.py -> code-search-0.1.dev0/code_search/dataflow/transforms
copying code_search/dataflow/transforms/function_embeddings.py -> code-search-0.1.dev0/code_search/dataflow/transforms
copying code_search/dataflow/transforms/github_bigquery.py -> code-search-0.1.dev0/code_search/dataflow/transforms
copying code_search/dataflow/transforms/github_dataset.py -> code-search-0.1.dev0/code_search/dataflow/transforms
copying code_search/nmslib/__init__.py -> code-search-0.1.dev0/code_search/nmslib
copying code_search/nmslib/search_engine.py -> code-search-0.1.dev0/code_search/nmslib
copying code_search/nmslib/search_server.py -> code-search-0.1.dev0/code_search/nmslib
copying code_search/nmslib/cli/__init__.py -> code-search-0.1.dev0/code_search/nmslib/cli
copying code_search/nmslib/cli/arguments.py -> code-search-0.1.dev0/code_search/nmslib/cli
copying code_search/nmslib/cli/create_search_index.py -> code-search-0.1.dev0/code_search/nmslib/cli
copying code_search/nmslib/cli/start_search_server.py -> code-search-0.1.dev0/code_search/nmslib/cli
copying code_search/t2t/__init__.py -> code-search-0.1.dev0/code_search/t2t
copying code_search/t2t/function_docstring.py -> code-search-0.1.dev0/code_search/t2t
copying code_search/t2t/function_docstring_extended.py -> code-search-0.1.dev0/code_search/t2t
copying code_search/t2t/query.py -> code-search-0.1.dev0/code_search/t2t
copying code_search/t2t/similarity_transformer.py -> code-search-0.1.dev0/code_search/t2t
Writing code-search-0.1.dev0/setup.cfg
Creating tar archive
removing 'code-search-0.1.dev0' (and everything under it)
Collecting apache-beam==2.5.0
  Using cached https://files.pythonhosted.org/packages/c6/96/56469c57cb043f36bfdd3786c463fbaeade1e8fcf0593ec7bc7f99e56d38/apache-beam-2.5.0.zip
  Saved /tmp/tmpe4wC_a/apache-beam-2.5.0.zip
Successfully downloaded apache-beam
Collecting apache-beam==2.5.0
  Using cached https://files.pythonhosted.org/packages/ff/10/a59ba412f71fb65412ec7a322de6331e19ec8e75ca45eba7a0708daae31a/apache_beam-2.5.0-cp27-cp27mu-manylinux1_x86_64.whl
  Saved /tmp/tmpe4wC_a/apache_beam-2.5.0-cp27-cp27mu-manylinux1_x86_64.whl
Successfully downloaded apache-beam
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md
jlewi commented 5 years ago

The command above should be submitting a Dataflow job which does the processing. So you should check that the Dataflow job actually completed successfully.

Can you check your dataflow job in the UI? Does it show a WriteToBigQuery step if you expand it? Did your Dataflow job actually complete?

Here's what I see in the GCP Cloud Console

dataflow_graph

connected-bsamadi commented 5 years ago

I just checked it. I see three failed jobs. One of them lasted for a day and 13 hours and I was charged $162 for it!

jlewi commented 5 years ago

@connected-bsamadi sorry to hear that. There was a bug in the Dataflow job that was fixed by #302. This prevented the job from running efficiently. It show now run in elapsed time of about 20 minutes and use ~110 CPU hours.

connected-bsamadi commented 5 years ago

Thanks @jlewi. We are talking to Billing Support of Google to see if they can help us. It was a bit weird that every step of the job had failed yet it continued for 91 CPU days.

jlewi commented 5 years ago

301 is tracking creating the dataset if it doesn't exist.