Only regenerate documentation if something changed in the code

GoogleCodeExporter commented 9 years ago

Proposed additional logic

1) During documentation generation, save 'documentation creation date' [A]
   in $OUTPUT_DIR
2) During documentation generation, scan $INPUT_DIR for
   max(date/time of file modification) [B]
3) If B >=A then documentation generation is required
   otherwise not really

I believe that this useful logic for people who want to include documentation 
generation as an automatic part of their data integration runs, especially if 
these runs are relatively short (e.g. every 15 min).

Original issue reported on code.google.com by jan.aertsen on 19 Oct 2010 at 10:20

GoogleCodeExporter commented 9 years ago

Hi Jan!

well, in fact it currently already only generates documentation only for 
changed / added files (and it will delete docs for deleted files). It's 
currently not yet set up for the target directories though - the 
create-output-subdirs job is always executed for all subdirs, even if they 
exist already. 

From what I can see, in an unmodified run, attempting to create dirs that 
already exist seems to take most of the time.

So, my proposal would be to fix that first - that way, the job will run 
onsiderably faster when there is nothing todo

Original comment by roland.bouman on 20 Oct 2010 at 12:05

GoogleCodeExporter commented 9 years ago

Practically all my code is organised in sub-directories. Guess that why I 
didn't notice the process is incremental already. Anyhow, I need to implement a 
piece of code as described above for KFF. I need to make sure that back-up only 
happens when code has changed  :-)  I guess we could use the same logic 
(backward compatible to 3.2.x)

Original comment by jan.aertsen on 20 Oct 2010 at 6:28

GoogleCodeExporter commented 9 years ago

Jan, yes, absolutely :)
I fear the current transformation may not be entirely clean enough for a 
generic reusable transformation, but you can certainly get a headstart by 
copying process-files.ktr and throwing out what you don't need.

The logic is: 
1) use "Get subfolder names" step to fetch directories. this is available in 
kettle 3.2, and unlike the "Get filenames" step does have an option to recurse 
subdirs

2) have the subfolder outputstream kick off a get filenames step. 

3) Do step 1 and 2 both for the source dir and for the target dir, and use a 
"Merge diff" step to compare relative path and (short) filename. This will 
identify deleted and new files for free. Updated and unmodified files show up 
as identical and need further processing to discenr between updates and 
unmodified files.

4) In the stream that shows up as "identical" in the diff output, use "stream 
lookup" steps to fetch data for source file and target file. 

5) Use a filter to compare last modified time of source and target files to see 
if a file is updated or unmodified.

Thats it :)

Original comment by roland.bouman on 20 Oct 2010 at 6:45

Added labels: Type-Enhancement
Removed labels: Type-FeatureRequest

GoogleCodeExporter commented 9 years ago

Ok - I modified process-files to only output direcories that do not yet exist 
in the output dir. This should shave off a number of seconds from the execution 
time in case you have a large number of directories.

Right now I don't have time to make the solution really clean and reusable, but 
at some point in the future I should probably cut up process-files in a number 
of jobs. When I do that I should probably also do an incremental update of the 
template directory, just because we can :)

For now I'm moving on to implementing new features.

Original comment by roland.bouman on 20 Oct 2010 at 7:39

Changed state: Fixed

khots / kettle-cookbook

Only regenerate documentation if something changed in the code #28