graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.89k stars 1.26k forks source link

[Q] Question regarding Efficiently Loading a Large Time Series Dataset into Graphite #2717

Closed AKheli closed 3 years ago

AKheli commented 3 years ago

I am trying to load 100 billion multi-dimensional time series datapoints into Graphite from a CSV file with the following format:

timestamp value_1 value_2 .... value_n

I tried to find a fast loading method on the official documentation and here's how I am currently doing the insertion (my codebase is in Python):

f = open(args.file, "r")
bucket_size = int(5000 / columns)
total_time = 0.0
insert_value = ""
for i in tqdm(range(rows)):
    value = f.readline()[:-1].split(" ")
    currentTime, args.format = get_datetime(value[0])
    currentTime = (currentTime - datetime(1970, 1, 1)).total_seconds()
    currentTime = int(currentTime)

    for j in (range(columns)):
        insert_value += "master." + args.format + ".dim" + str(j) + " " + value[j + 1] + " " + str(currentTime) + "\n"
    if (i + 1) % bucket_size == 0 or i == rows - 1:
        sock.sendall( bytes(insert_value, "UTF-8"))
        insert_value = ""
sock.close()
f.close()

As the code above shows, my code is reading the dataset CSV file and preparing batches of 5000 data points, then sending the datapoints using sock.sendall.

However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 5 Million rows with 1500 columns each has been running for 40 hours and still has 15 hours to finish:

image

Is there a better way to load the dataset into Graphite?

deniszh commented 3 years ago

Hi @AbdelouahabKhelifati

Line protocol is fastest way to insert data to Graphite, probably, but implementation can be different, of course. Maybe increasing paralellism can help. Or maybe Graphite is not perfect tool for your task - maybe you need some advanced TSDB, like QuestDB, TimescaleDB or InfluxDB. My choice for any heavy analytics queries would be QuestDB nowadays.

AKheli commented 3 years ago

Thanks deniszh.

Would you have a source for the claim that line protocol is the fastest way to load data into Graphite?

deniszh commented 3 years ago

My only source is my experience, I only remember when people tried line and pickle protocol - and line was faster. And these 2 protocols are only two which exists. Maybe direct insert into whisper files is theorecally fastest, but it will require much more code and data transformations.

piotr1212 commented 3 years ago

You'll probably have to use the whisper library directly https://github.com/graphite-project/whisper, if that is not fast enough there is also a golang implementation

AKheli commented 3 years ago

Thank you deniszh and piotr1212!