helium / blockchain-etl

Blockchain follower that follows and stores the Helium blockchain
Apache License 2.0
64 stars 37 forks source link

ETL Optimization - Using COPY instead of INSERT #293

Closed anthonyra closed 11 months ago

anthonyra commented 2 years ago

https://github.com/helium/blockchain-etl/blob/4fb696f1ff203689eb6e076383f6efe651b495ef/src/be_db_gateway_status.erl#L90-L106

Upon looking into some performance issues I'm having right now with my ETL I noticed that the above INSERT is eating up a lot of time while being committed. (I'm not 100% sure if it's the only culprit but the one that stands out the most right now).

Looking into optimizations and fixes I stumbled upon some suggestions about populating a DB; https://www.postgresql.org/docs/11/populate.html

With some benchmarks: https://www.depesz.com/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/

It appears that these INSERTS should be batched and even more so converted into COPY if at all possible. Upon reviewing the code I noticed that an ETS is being used to store the requests prior to them being sent to Postgres. I'm not that well versed yet when it comes to Erlang and ETS but could that table be converted to either CSV or preferably the Binary format needed for COPY?

I looked into epgsql to ensure that copy is supported and it appears that in 2020 it was added; https://github.com/epgsql/epgsql/blob/devel/test/epgsql_copy_SUITE.erl

anthonyra commented 2 years ago

Also an additional thought to the above;

This query updates the gateway_status table, I wonder if it'd also make sense to have a quick_sync_mode that you can set that will skip this status update until the height you specify? That's dependent on if the data used for this INSERT is simply built from the current ledger state found on the ETL?