BigTable: Async client for high throughput mutate_rows

mayurjain0312 commented 6 years ago

Ubuntu 16.04 Python version and virtual environment information python --version - python 2.7.12

With a batch size of 300, total 3 nodes in an instance, the write throughput is not good for BigTable by using mutate_rows api call.

BULK_WRITE_BATCH_SIZE = 300
with open(file_path) as sensor_data_input_file:
    list_direct_row_obj = []
    for line in sensor_data_input_file:
        if not line.strip():
            continue

        sensor_json_data = json.loads(line)
        row_key = create_sensor_data_id(sensor_json_data)
        value = line

        direct_row_obj = bigtable.row.DirectRow(row_key, table)

        column_id = 'column_id_data'.encode('utf-8')
        direct_row_obj.set_cell(column_family_id, column_id, value.encode('utf-8'))

        list_direct_row_obj.append(direct_row_obj)
        direct_row_obj = None

        if len(list_direct_row_obj) == BULK_WRITE_BATCH_SIZE:
            table.mutate_rows(list_direct_row_obj)
            list_direct_row_obj[:] = []

    if list_direct_row_obj:
        table.mutate_rows(list_direct_row_obj)

sduskis commented 6 years ago

Can you please describe some of the symptoms you're experiencing? table.mutate_rows is a synchronous operation, so you're limited to one call at a time. How many rows per second are you seeing on the cloud console?

mayurjain0312 commented 6 years ago

Is this the only "bulk write" option that BigTable offers? From the console, I can see around 800 rows written per second. (Production instance with 3 nodes)

mayurjain0312 commented 6 years ago

We are currently using mongodb as our no-sql database and we would prefer moving to BigTable. As far as the speed of writing data to database is concerned, I see that mongodb is far exceeding BigTable(unless I am not using BigTable in the right manner)

mayurjain0312 commented 6 years ago

@sduskis: any updates?

sduskis commented 6 years ago

A 3 node Cloud Bigtable cluster can handle writing at least 30K rows per second, and each additional node adds 10K rows per second. We tested up to 3,500 nodes with Cloud Bigtable with a linear increase in performance.

The 800 rows per second is a limitation of the python client and what ever VM you're on. The process sends 300 rows, and waits for a response. You can run multiple processes that each read in a different file to improve total throughput against your cluster.

The Java client has an async client that performs far better in these situations, since it has a robust async client. We have not ported that functionality to python yet. Here is an example with the Java version.

(FWIW, I'm the primary developer on the Java client)

mayurjain0312 commented 6 years ago

@sduskis : Thanks for responding. Do you know when can this async functionality for Python be ready?

sduskis commented 6 years ago

We don't have an ETA for this functionality in Python.

basphilippus commented 6 years ago

Hi,

Is there any progress on this feature? I would really like to scale my writes to Bigtable with Python code.

sduskis commented 6 years ago

We still do not have concrete plans to implement this feature.

mholiv commented 5 years ago

Hello! Are there any updates on plans to implement this feature?

Async is quickly becoming the norm in terms of modern python development. :)

BigTable IO is consistently the only blocking IO I encounter on a regular basis.

DynamoDB as a async wrapper in the form of aioboto3

cwbeitel commented 5 years ago

@sduskis I can help implement this if you can provide me a little guidance. Can provide a rough sketch?

sduskis commented 5 years ago

@cwbeitel There's a lot that goes into this. There's a Java implementation that would be similar, but I'm not sure how much it would help (here).

Here are some constraints that I think are important:

Allow up to n (probably 4?) concurrent RPCs in progress, after that block.
Each RPC batch should consist of mutations that meet one of the max criteria:
1. 125 rows
2. 10,000 total "entries" / column qualifier updates across all rows
3. an RPC size of no more than 5 MB
The interface should have the following methods:
1. b = table.createBatcher() - creates a new multi-RPC batcher
2. b.add(mutation) - Blocks, if necessary. Adds the row an in - memory buffer. Send an RPC if the buffer is full.
3. b.flush() - sends out the in-memory buffer and waits for all RPCs to complete.

cwbeitel commented 5 years ago

@sduskis Yeah that's cool I hear those thoughts. Thanks for sharing the Java implementation, that's helpful. Looks like the PubSub python client code, e.g. the published message batcher, would also be helpful to emulate.

Here's a gist contextualizing this in the case of streaming deep learning training examples to a cbt table.

Having had a look at this I'm probably going to first try to accomplish the same using the Golang client given how involved it would be to do this with the Python client.