tdt / input

A package which allows you to set up your own EML (Extract, Map and Load) tool.
http://thedatatank.com
5 stars 7 forks source link

Allow for scalable E(T)ML processes #55

Closed coreation closed 10 years ago

coreation commented 11 years ago

Problem

Currenctly E(T)ML processes are being run in one go along with the job call. Great, but not scalable, for a cronjob will call the job uri to run it's EML from time to time and wait for it to finish, meaning the apache idle timeout has to be upped.

Solution

Threading ( yes, threading not asynchronous executes ). Async exec's are possible because threading in PHP were quite a low level DIY thing. However, https://github.com/krakjoe/pthreads has apparently the solution. So my solution is to make the Input class an extension from the Tread class and make it run as a thread.

In a next stage we should have some sort of "status" page, or just add status to the job where the status of the current job is held (e.g. running, sleeping, ....). Last but not least this links perfectly with https://github.com/tdt/input/issues/53 where the logging can now be done to a file (for every chunk for example) so that not only this logging permanently exists (or not, still open for debate) but is also user friendly (see issue 53 for more explanation.)

pietercolpaert commented 11 years ago

I'm in favor of changing the {joburi}/run towards a command line command

coreation commented 11 years ago

So using an exec to jumpstart the EML? Basically the same result with threading, only now you'll have to pass parameters with it, make a new script file that interprets there parameters and starts up the Input:

e.g. joburi = win-events exec(php jumpstart.php win-events); return ??? ( => This code should probably be 200, but perhaps needs more documentation as we don't know if the job will finish correctly).

In jumpstart.php there should be something like (pseudocode).

<?php

fetchJob($jobname); input = new Input($job); input->execute();

I'm in favor of both (threading or exec) I just got excited when I saw a fresh project blooming providing PHP with threads, last year these libraries weren't there.

pietercolpaert commented 11 years ago

I was thinking about just using the job-uri as identification but not as a trigger anymore. In TDTInput I'd delete the Controllers and I would just allow for a command line script to be able to launch it.

In the end, we have always launched it over command line before. Running this job in the browser didn't make lot of sense in the past I believe.

coreation commented 11 years ago

Agreeing on the cli only part, just saying you have to pass the job-identifier, by passing his name with the script so that the script knows which job ETML he has to start. I disagree on removing the controllers, because how are you going to see what jobs are configured, and how? This also breaks with the notion of you GET what you PUT does it not?

pietercolpaert commented 11 years ago

Only the controller to execute the job I meant.

coreation commented 11 years ago

That's the same controller.... ;) We just delete the part that says "when you pass /run or /test we run the job." Thoughts on the job now putting it's logs after every chunk into a file using the logging directory configured in general.json ? I wouldn't pass the entire log to the CLI anymore, makes you wait untill everything is finished, making early error detection impossible and development cycles hell sometimes.

pietercolpaert commented 11 years ago

Yes ;)

coreation commented 11 years ago

Ok, this closes the discussion for now, I'll start implementing it on a different branch.

coreation commented 10 years ago

Present in the Blackwell branch, soon to be pushed to master.