launchflow / buildflow

BuildFlow, is an open source framework for building large scale systems using Python. All you need to do is describe where your input is coming from and where your output should be written, and BuildFlow handles the rest. No configuration outside of the code is required.
https://docs.launchflow.com/buildflow
Apache License 2.0
193 stars 7 forks source link

add sink for clickhouse #209

Closed boetro closed 10 months ago

roopeshsn commented 1 year ago

Hi, @boetro! I would like to work on this issue! I would start with adding io for clickhouse and figuring out from there would make sense.

boetro commented 1 year ago

SG! I went ahead and assigned it to you, feel free to ask in questions in here or on discord!

roopeshsn commented 11 months ago

Hi, @boetro! There are two options to proceed with:

  1. Unlike the DuckDB, the Clickhouse isn't an in-memory solution meaning it requires a server. Clickhouse python client doesn't create an instance of a server when using the get_client method. So it requires manual effort to set up the clickhouse server from the downloads page.
  2. The second option is to create an instance of a server programmatically through Docker or using shell scripts to download and install clickhouse server from the downloads page if not present already.

Which option should I proceed with? I would like to hear your thoughts!

boetro commented 11 months ago

Great question!

Typically how we handle provisioning new resources is by using pulumi: https://www.pulumi.com

The BigQuery sink is a good example of this: https://github.com/launchflow/buildflow/blob/main/buildflow/io/gcp/bigquery_table.py this will actually provision a new bigquery table when the user runs buildflow apply

I think for now I would probably punt on this aspect and more focus on connecting to an existing click house database. Once that is done we can revisit what would be the best to provision a new database.

roopeshsn commented 11 months ago

Great question!

Typically how we handle provisioning new resources is by using pulumi: https://www.pulumi.com

The BigQuery sink is a good example of this: https://github.com/launchflow/buildflow/blob/main/buildflow/io/gcp/bigquery_table.py this will actually provision a new bigquery table when the user runs buildflow apply

I think for now I would probably punt on this aspect and more focus on connecting to an existing click house database. Once that is done we can revisit what would be the best to provision a new database.

Alright, I'll make the changes ASAP!

roopeshsn commented 11 months ago

Hi @boetro! Apart from creating a table, I am done with all other implementations. I am not able to create a table with attributes (schema) dynamically from a dictionary or a df through clickhouse python client. The equivalent query I found in the DuckDB strategies is this, con.execute(f'CREATE TABLE "{self.table}" AS SELECT * FROM df').

I reached out to the clickhouse community (slack) with my query but didn't get help. Do you know any workaround?

boetro commented 11 months ago

I think it's fine if the table isn't dynamically created, and it assumes the user would have created it.

DuckDB is kind of the exception here, our other dataware house like sinks (snowflake and bigquery) assume the user has created it (but do provide the pulumi option for creating them).