InfluxDB injection seems to stall for large volumes of data

MatinF commented 2 years ago

We develop IoT devices that upload large amounts of data. As part of this we provide an InfluxDB script that enables users to process the data and write it to InfluxDB. The functions related to InfluxDB are in below file:

https://github.com/CSS-Electronics/canedge-influxdb-writer/blob/master/utils_db.py

We are, however, having an issue where the script runs slow in some instances - and for some users it's difficult to upload large amounts of data into e.g. their paid InfluxDB 2.0 cloud instance.

The problem is probably something very basic in our implementation above - e.g. not keeping the connection alive, or the fact that we write 'each signal' separately, instead of writing a 'block' of data with one time series column and all signals in each their column. However, we are unsure what exactly would be the issue.

Any feedback would be very helpful.

Martin

bednar commented 1 year ago

Hi @MatinF,

thanks for using our client.

How many rows has your DataFrame in write_influx function?

As a first suggestion you can use with statement in the write_influx function to leaves closing the _write_client on the Python:

def write_influx(self, name, df, tag_columns):
    """Helper function to write signal dataframes to InfluxDB"""
    from influxdb_client import WriteOptions

    if self.test == 0:
        print("Please check your InfluxDB credentials")
        return

    with self.client.write_api(
            write_options=WriteOptions(
                batch_size=5000,
                flush_interval=1_000,
                jitter_interval=2_000,
                retry_interval=5_000,
            )
    ) as _write_client:
        _write_client.write(self.influx_bucket, record=df, data_frame_measurement_name=name,
                            data_frame_tag_columns=tag_columns)

    if self.verbose:
        print(f"- SUCCESS: {len(df.index)} records of {name} written to InfluxDB\n\n")

Regards

MatinF commented 1 year ago

Thanks a lot, I'll try that tip!

Our users generally write large dataframes, typically with between 500 to 500000 rows. A user may have e.g. 50 devices generating each 1-5 GB of data that needs to go into InfluxDB per month.

As such, I have also been wondering if we should do some of the following:

1) Change the WriteOptions somehow to better accomodate these types of sizes. The current settings we use may be suboptimal, but I am unsure what the ideal settings would be.

2) Currently, our script is set up so that we write 'each signal' separately as a new dataframe. For example, a log file may contain 50+ signals. We currently split this dataframe so that we 'group it by signals', writing for each signal the timestamp and signal values as per the above script. I wonder if it would be substantially better to not do this 'group by signals' - and instead feed a dataframe with 1 timestamp column and e.g. 50+ signal columns (all resampled to the same timestamp column)? Not sure whether this is better performance wise, though - and it's a bit unclear how the dataframe should be parsed in such a case.

3) I also looked at some of the newer additions for large-batch processing - would it make sense for our type of use case to consider these instead of our current implementation, or would the performance impact be insignificant?

https://github.com/influxdata/influxdb-client-python#how-to-efficiently-import-large-dataset https://github.com/influxdata/influxdb-client-python/blob/master/examples/import_data_set_multiprocessing.py

bednar commented 1 year ago

Change the WriteOptions somehow to better accomodate these types of sizes. The current settings we use may be suboptimal, but I am unsure what the ideal settings would be.

You can remove the jitter interval - jitter_inteval=0. The InfluxDB Cloud is able to handle a lot of parallel writes.

I also looked at some of the newer additions for large-batch processing - would it make sense for our type of use case to consider these instead of our current implementation, or would the performance impact be insignificant?

It depends on your architecture. If your code is CPU-intensive the Python doesn't possibility to switch to another thread to ingest data - https://tenthousandmeters.com/blog/python-behind-the-scenes-13-the-gil-and-its-effects-on-python-multithreading/

bednar commented 1 year ago

This issue has been closed because it has not had recent activity. Please reopen if this issue is still important to you and you have additionally information.

influxdata / influxdb-client-python

InfluxDB injection seems to stall for large volumes of data #535