mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.07k stars 100 forks source link

Improve parallel read #74

Open leo-schick opened 2 years ago

leo-schick commented 2 years ago

See #75

leo-schick commented 2 years ago

I have this now running in production without any issue.

jankatins commented 9 months ago

Whats the actual problem here? That the reads run as python code in threads and therefore run into the GIL? I always thought due to the "run everything as subprocess" we never run into that problem?

This feels like a lot of complexity and I don't really see the gain here. Any chance to make that gain clearer to me?

leo-schick commented 9 months ago

@jankatins the problem what I was trying to solve is that when running a parallel task, the commands for the internal sub pipelines need to be evaluated before the pipeline starts working. I had a file bucket with over millions of files which I had to process. In my case, the pipeline became so big that it was unable to start; probably because of memory consumption or the job was still reading the complete file list of the bucket after more than 1 hour.

This PR changes the parallel task behavior by putting the sub pipeline generation into a separate feed worker task. This PR is complex and I am not 100% sure if it should be part of mara. It is a first try to implement file based micro batch streaming via mara. I realized that it might not have been the best idea💡 I had in the last years 😉