projectnessie / nessie

Nessie: Transactional Catalog for Data Lakes with Git-like semantics
https://projectnessie.org
Apache License 2.0
983 stars 125 forks source link

Nessie and data streaming #2649

Open valdo404 opened 2 years ago

valdo404 commented 2 years ago

Hi anyone,

I cannot see an example of data streams that would replace an ETL and would work with Nessie. As I am interested in using nessie as a base for my next data platform, I would like to know more about it.

Expected: having an explanation in the documentation in order to understand if Nessie works with streaming or not. Current state of affairs: no clues about it

best !

ajantha-bhat commented 2 years ago

hi @valdo404, Nessie has been integrated to both spark and Flink engines (which supports stream data processing) with Iceberg and (WIP) delta-lake table formats . we have demos for batch processing here The configurations to use Nessie should be same.

For stream data processing, Nessie relies on the capabilities of underlying table format. For Nessie, merge/upsert/insert from a streaming source is just another commit from underlaying table format on particular reference. So, it should work.

valdo404 commented 2 years ago

Thank you for this reply Ajantha. I will test it ASAP and will let you know.

ajantha-bhat commented 2 years ago

@valdo404 : No problem. you can contact us if any doubts or problem.

Also don't try with latest Nessie, it doesn't work with latest Iceberg [we have a pending PR at iceberg]

So, as mentioned in the demo use Nessie 0.9.2 with Iceberg.