Closed ierosodin closed 4 years ago
Oups, not sure if mara as a whole is designed to run in parallel.
As a first workaround, you can set the frequency of gathering statisticts to a very high value to basically disable it: https://github.com/mara/data-integration/blob/master/data_integration/config.py#L46
I see.
Because data-integration is a lightweight ETL framework, I think it is suitable for running rapid and short lifecycle's task. I'm now using mara as the framework to work with a modulable data process tool. (maybe there is another framework more suitable, but I haven't found it)
Thanks!
So far I did not think of a reason to not run pipelines in parallel. In fact, we do that in quite a number of projects.
I didn't see the duplicate timestamp issue yet, but it should be easy to fix. I'd suggest to add the run_id
to the primary key. The run_id
is known in the thing that logs things to the db: https://github.com/mara/data-integration/blob/master/data_integration/logging/run_log.py#L69 (it is set a few lines below)
So far I did not think of a reason to not run pipelines in parallel. In fact, we do that in quite a number of projects.
I didn't see the duplicate timestamp issue yet, but it should be easy to fix. I'd suggest to add the
run_id
to the primary key. Therun_id
is known in the thing that logs things to the db: https://github.com/mara/data-integration/blob/master/data_integration/logging/run_log.py#L69 (it is set a few lines below)
I see. I'll add run_id
as the primary key. Thanks!
HI @ierosodin, did you get further with this?
HI @ierosodin, did you get further with this?
Hi @martin-loetzsch , I recently used run_id
as the primary key to avoid encountering duplicated timestamp.
Do you want to make a PR for this?
Hi, I'm now using mara to create a pipeline, and use multiprocessing to parallelly run the workflow. In my situation, it may have several pipeline running at the same time; however, it would incur duplicated timestamp key created. If it is possible to use another column (like node_output_id or something else) as the primary key? Or maybe using a global lock to avoid creating data at the same time.
EDIT: For now, I add another column name 'index' as the primary_key and index column to fix this problem.
Thanks! Sincerely, Tony