Open RuhuaJiang opened 4 years ago
@RuhuaJiang Is the performance disparity happens only when feature columns number very large (e.g., 180 as mentioned), or it happens even when the feature columns number small (like 1-5)?
Thanks for reporting this issue. I did BQ reader benchmarks and tried to optimize it before first release for Wiki dataset with few columns and compared it to GCS. According to my tests BQ performance was about same as GCS. BQ tended to be faster for powerful machines with multiple CPUs and slower for low power VMs.
Machine | Prefetch | Num streams | Sloppy | BQ | GCS |
---|---|---|---|---|---|
Local TF2.0 | N | 10 | False | 135014 | 76682 |
Local TF2.0 | N | 1 | False | 51449 | 120388 |
Local TF2.0 | N | 10 | True | 196110 | 79824 |
Local TF2.0 | Y | 10 | False | 192840 | 79719 |
Local TF2.0 | Y | 1 | False | 60328 | 140254 |
Local TF2.0 | Y | 10 | True | 209776 | 104283 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 10 | False | 32954 | 78432 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 1 | False | 34477 | 84363 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | N | 10 | True | 33629 | 73500 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 10 | False | 24904 | 76749 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 1 | False | 31163 | 86520 |
DLVM n1-standard-1 (1 vCPU, 3.75 GB memory) US-central TF1.15 | Y | 10 | True | 24660 | 72929 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 10 | False | 56173 | 63212 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 1 | False | 32932 | 84269 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | N | 10 | True | 64520 | 62389 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 10 | False | 69508 | 68000 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 1 | False | 31703 | 101243 |
GCE n1-standard-8 (8 vCPUs, 30 GB memory) us-central1-a TF2.0 | Y | 10 | True | 78304 | 79070 |
Arrow was a bit faster, also I didn't do deep analysis yet.
Num Streams | Sloppy | Avro | Arrow | |
---|---|---|---|---|
Local TF2.1 | 1 | False | 125176 | 185062 |
Local TF2.1 | 10 | False | 88777 | 100657 |
Local TF2.1 | 1 | True | 130562 | 186283 |
Local TF2.1 | 10 | True | 87581 | 104446 |
Here is a benchmark I used https://github.com/vlasenkoalexey/bigquery_perftest, can you give it a shot?
I plan to spend some time debugging it once I'm done with my current project. One change which should help with throughput is creating multiple gRPC streams.
@yongtang good question. I had another smaller dataset that has 22 features (BigQuery 22 columns vs TFExample contains 22 features). the perf of BigQuery is slower too. Haven't tried even smaller numbers like 1...5 though.
Batch size | BigQuery | Gzipped TFExample |
---|---|---|
32 | 22,500 | 13,000 |
256 | 22,700 | 47,800 |
1024 | 22,700 | 52,300 |
@vlasenkoalexey thanks for the info, very informative. Let me try https://github.com/vlasenkoalexey/bigquery_perftest as well (with small & larger number of columns)
I got a repro, will see what I can do to make it better.
@vlasenkoalexey @RuhuaJiang any updates on this issue?
Sorry, forgot to provide an update here. After profiling reader on a benchmark with 100+ columns (see https://github.com/vlasenkoalexey/bigquery_perftest/blob/master/bq_perftest_mult_columns.py) I realized that the bottleneck is batch step:
streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
read_rows,
cycle_length=streams_count64,
num_parallel_calls=streams_count64,
deterministic=not(sloppy))
dataset = dataset.batch(batch_size)
If you move batching to the same thread as read, performance is going to be much better.
Here is updated sample:
def _read_rows(stream):
dataset = read_session.read_rows(stream)
dataset = dataset.batch(batch_size)
return dataset
streams_ds = tf.data.Dataset.from_tensor_slices(streams)
dataset = streams_ds.interleave(
_read_rows,
cycle_length=streams_count64,
num_parallel_calls=streams_count64,
deterministic=not(sloppy))
Originally I planned to update API to use this approach all the time, but realized that it is not always desirable. I'll update BQ readme page to make it clear and close this bug.
Can I reopen this issue as I found when number of columns is not large the performance from using BigQuery reader is significantly lower than GCS reader. The tensorflow io used here is tensorflow_io==0.17.0. Could tensorflow_io > 0.17.0 for instance 0.22.0 could deliver something difference. I see the performance chart but is it general rule of thumb that GCS bucket has always better performance than BigQuery and why?
Did you have a chance to try approach suggested in https://github.com/tensorflow/io/issues/1066#issuecomment-757074730 ? And also please confirm that your data is stored in the location close to where you are reading it from. It is known that BQ is slightly slower than GCS, but not that much.
@vlasenkoalexey
Based on some of our internal benchmark within Twitter we saw BigQuery reader performance is relatively slower comparing to TFExample. Just give a sense:
With that benchmark we have 180 features (BigQuery columns, internal data so I can't share table link here), they are all primitive types (BOOLEAN, FLOAT, INTEGER) without repeated fields.
Unit is "examples per second" on a 32 cores machine with TF2.2 tf2-2-2-cpu image (https://cloud.google.com/ai-platform/deep-learning-vm/docs/images)
I understand there are many factors out there such as parameters such as batch_size, requested_streams for BigQuery as well as reader_num_threads, parser_num_threads for TFExample (we are using https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_batched_features_dataset for TFExample decoding when doing this benchmark).
I could potentially make a sharable and repeatable benchmark using BigQuery public dataset. But it would take some amount of work (convert that to TFExample, make that benchmark code etc). So before that, I wanted to know are you aware of the performance issue, if yes is there some plan on your side addressing these?
At this point, my gut feeling is it is very likely caused by BigQuery Storage API itself rather than the Avro -> Tensor decoding part, wondering do you know about that or not.
I also tried to switch using Arrow instead of Avro as dataformat, but because Arrow integration has several limitations
With a limited amount of features been streamed. There is no big difference in terms of examples per second so far (it might because the feature type here is just integer and only few features, so the difference of Avro vs Arrow are not that much).
Here is the BigQuery Benchmark code: