Duplication of pk in table data_integration_system_statistics

mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

MIT License

2.07k stars 100 forks source link

Duplication of pk in table data_integration_system_statistics #22

Closed ierosodin closed 4 years ago

ierosodin commented 4 years ago

Hi, I'm now using mara to create a pipeline, and use multiprocessing to parallelly run the workflow. In my situation, it may have several pipeline running at the same time; however, it would incur duplicated timestamp key created. If it is possible to use another column (like node_output_id or something else) as the primary key? Or maybe using a global lock to avoid creating data at the same time.

EDIT: For now, I add another column name 'index' as the primary_key and index column to fix this problem.

Thanks! Sincerely, Tony

jankatins commented 4 years ago

Oups, not sure if mara as a whole is designed to run in parallel.

As a first workaround, you can set the frequency of gathering statisticts to a very high value to basically disable it: https://github.com/mara/data-integration/blob/master/data_integration/config.py#L46

ierosodin commented 4 years ago

I see.

Because data-integration is a lightweight ETL framework, I think it is suitable for running rapid and short lifecycle's task. I'm now using mara as the framework to work with a modulable data process tool. (maybe there is another framework more suitable, but I haven't found it)

Thanks!

martin-loetzsch commented 4 years ago

So far I did not think of a reason to not run pipelines in parallel. In fact, we do that in quite a number of projects.

I didn't see the duplicate timestamp issue yet, but it should be easy to fix. I'd suggest to add the run_id to the primary key. The run_id is known in the thing that logs things to the db: https://github.com/mara/data-integration/blob/master/data_integration/logging/run_log.py#L69 (it is set a few lines below)

ierosodin commented 4 years ago

So far I did not think of a reason to not run pipelines in parallel. In fact, we do that in quite a number of projects.

I didn't see the duplicate timestamp issue yet, but it should be easy to fix. I'd suggest to add the run_id to the primary key. The run_id is known in the thing that logs things to the db: https://github.com/mara/data-integration/blob/master/data_integration/logging/run_log.py#L69 (it is set a few lines below)

I see. I'll add run_id as the primary key. Thanks!

martin-loetzsch commented 4 years ago

HI @ierosodin, did you get further with this?

ierosodin commented 4 years ago

HI @ierosodin, did you get further with this?

Hi @martin-loetzsch , I recently used run_id as the primary key to avoid encountering duplicated timestamp.

martin-loetzsch commented 4 years ago

Do you want to make a PR for this?