tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
702 stars 281 forks source link

[bigquery-io] Performance issue when number of columns is relative large #1066

Open RuhuaJiang opened 4 years ago

RuhuaJiang commented 4 years ago

@vlasenkoalexey

Based on some of our internal benchmark within Twitter we saw BigQuery reader performance is relatively slower comparing to TFExample. Just give a sense:

With that benchmark we have 180 features (BigQuery columns, internal data so I can't share table link here), they are all primitive types (BOOLEAN, FLOAT, INTEGER) without repeated fields.

Unit is "examples per second" on a 32 cores machine with TF2.2 tf2-2-2-cpu image (https://cloud.google.com/ai-platform/deep-learning-vm/docs/images)

Batch size BigQuery Reader Gzipped TFExample decoding
32 1,279 6,317
256 10,223 26,646
1024 16,092 31,724

I understand there are many factors out there such as parameters such as batch_size, requested_streams for BigQuery as well as reader_num_threads, parser_num_threads for TFExample (we are using https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_batched_features_dataset for TFExample decoding when doing this benchmark).

I could potentially make a sharable and repeatable benchmark using BigQuery public dataset. But it would take some amount of work (convert that to TFExample, make that benchmark code etc). So before that, I wanted to know are you aware of the performance issue, if yes is there some plan on your side addressing these?

At this point, my gut feeling is it is very likely caused by BigQuery Storage API itself rather than the Avro -> Tensor decoding part, wondering do you know about that or not.

I also tried to switch using Arrow instead of Avro as dataformat, but because Arrow integration has several limitations

With a limited amount of features been streamed. There is no big difference in terms of examples per second so far (it might because the feature type here is just integer and only few features, so the difference of Avro vs Arrow are not that much).

Here is the BigQuery Benchmark code:

# Copyright 2016 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Tests for BigQuery Ops."""

import concurrent.futures
from io import BytesIO
import os
import json
import fastavro
import numpy as np
import grpc  # pylint: disable=wrong-import-order
import time
import tensorflow as tf  # pylint: disable=wrong-import-order

from tensorflow.python.framework import dtypes  # pylint: disable=wrong-import-order
from tensorflow.python.framework import errors  # pylint: disable=wrong-import-order
from tensorflow.python.framework import ops  # pylint: disable=wrong-import-order
from tensorflow import test  # pylint: disable=wrong-import-order
from tensorflow_io.bigquery import (
    BigQueryTestClient,
    BigQueryClient,
)  # pylint: disable=wrong-import-order

from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession

os.environ['GOOGLE_APPLICATION_CREDENTIALS']='my credentials'
GCP_PROJECT_ID = 'my-project'
DATASET_GCP_PROJECT_ID = "bigquery-public-data"
DATASET_ID = "samples"
TABLE_ID = "wikipedia"

def get_dataset_examples_per_second(dataset, num_iterations=5, num_batches=1000, batch_size=32):
    examples_per_second = []
    for _ in range(num_iterations):
      n = 0
      start_t = time.time()
      for data_batch in dataset.take(num_batches):
        n += batch_size
      delta_t = time.time() - start_t
      examples_per_second.append((num_batches * batch_size) / delta_t)
      print('Processed %d entries in %f seconds. [%.2f] examples/s' % (
            n, delta_t, examples_per_second[-1]))
    print('Average [%.2f] examples/s using TensorFlow v[%s]' % (
          sum(examples_per_second) * 1.0 / len(examples_per_second),
          tf.__version__))

def main():
  ops.enable_eager_execution()
  client = BigQueryClient()  
  selected_fields = ["id",
       "num_characters",
       "timestamp",
       "wp_namespace",
       #"is_redirect",
       "revision_id"
      ]

  output_types = [dtypes.int64,
       dtypes.int64,
       dtypes.int64,
       dtypes.int64,
       #dtypes.bool,
       dtypes.int64]

  read_session = client.read_session(
      "projects/" + GCP_PROJECT_ID,
      DATASET_GCP_PROJECT_ID, TABLE_ID, DATASET_ID,
      selected_fields,
      output_types,
      requested_streams=2, #adjust this when benchmark
      row_restriction="num_characters > 1000",
      data_format=BigQueryClient.DataFormat.ARROW)
  dataset = read_session.parallel_read_rows()
  get_dataset_examples_per_second(dataset)

if __name__ == '__main__':
  main()
yongtang commented 4 years ago

@RuhuaJiang Is the performance disparity happens only when feature columns number very large (e.g., 180 as mentioned), or it happens even when the feature columns number small (like 1-5)?

vlasenkoalexey commented 4 years ago

Thanks for reporting this issue. I did BQ reader benchmarks and tried to optimize it before first release for Wiki dataset with few columns and compared it to GCS. According to my tests BQ performance was about same as GCS. BQ tended to be faster for powerful machines with multiple CPUs and slower for low power VMs.

Machine Prefetch Num streams Sloppy BQ GCS
Local TF2.0 N 10 False 135014 76682
Local TF2.0 N 1 False 51449 120388
Local TF2.0 N 10 True 196110 79824
Local TF2.0 Y 10 False 192840 79719
Local TF2.0 Y 1 False 60328 140254
Local TF2.0 Y 10 True 209776 104283
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 N 10 False 32954 78432
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 N 1 False 34477 84363
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 N 10 True 33629 73500
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 Y 10 False 24904 76749
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 Y 1 False 31163 86520
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 Y 10 True 24660 72929
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 N 10 False 56173 63212
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 N 1 False 32932 84269
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 N 10 True 64520 62389
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 Y 10 False 69508 68000
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 Y 1 False 31703 101243
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 Y 10 True 78304 79070

Arrow was a bit faster, also I didn't do deep analysis yet.

  Num Streams Sloppy Avro Arrow
Local TF2.1 1 False 125176 185062
Local TF2.1 10 False 88777 100657
Local TF2.1 1 True 130562 186283
Local TF2.1 10 True 87581 104446

Here is a benchmark I used https://github.com/vlasenkoalexey/bigquery_perftest, can you give it a shot?

I plan to spend some time debugging it once I'm done with my current project. One change which should help with throughput is creating multiple gRPC streams.

RuhuaJiang commented 4 years ago

@yongtang good question. I had another smaller dataset that has 22 features (BigQuery 22 columns vs TFExample contains 22 features). the perf of BigQuery is slower too. Haven't tried even smaller numbers like 1...5 though.

Batch size BigQuery Gzipped TFExample
32 22,500 13,000
256 22,700 47,800
1024 22,700 52,300

@vlasenkoalexey thanks for the info, very informative. Let me try https://github.com/vlasenkoalexey/bigquery_perftest as well (with small & larger number of columns)

vlasenkoalexey commented 4 years ago

I got a repro, will see what I can do to make it better.

kvignesh1420 commented 3 years ago

@vlasenkoalexey @RuhuaJiang any updates on this issue?

vlasenkoalexey commented 3 years ago

Sorry, forgot to provide an update here. After profiling reader on a benchmark with 100+ columns (see https://github.com/vlasenkoalexey/bigquery_perftest/blob/master/bq_perftest_mult_columns.py) I realized that the bottleneck is batch step:

streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
     read_rows,
     cycle_length=streams_count64,
     num_parallel_calls=streams_count64,
     deterministic=not(sloppy))
dataset = dataset.batch(batch_size)

If you move batching to the same thread as read, performance is going to be much better.

Here is updated sample:

  def _read_rows(stream):
    dataset = read_session.read_rows(stream)
    dataset = dataset.batch(batch_size)
    return dataset

  streams_ds = tf.data.Dataset.from_tensor_slices(streams)
  dataset = streams_ds.interleave(
      _read_rows,
      cycle_length=streams_count64,
      num_parallel_calls=streams_count64,
      deterministic=not(sloppy))

Originally I planned to update API to use this approach all the time, but realized that it is not always desirable. I'll update BQ readme page to make it clear and close this bug.

stevenzhang-support commented 2 years ago

Can I reopen this issue as I found when number of columns is not large the performance from using BigQuery reader is significantly lower than GCS reader. The tensorflow io used here is tensorflow_io==0.17.0. Could tensorflow_io > 0.17.0 for instance 0.22.0 could deliver something difference. I see the performance chart but is it general rule of thumb that GCS bucket has always better performance than BigQuery and why?

vlasenkoalexey commented 2 years ago

Did you have a chance to try approach suggested in https://github.com/tensorflow/io/issues/1066#issuecomment-757074730 ? And also please confirm that your data is stored in the location close to where you are reading it from. It is known that BQ is slightly slower than GCS, but not that much.