Add caching hashes to new log command

mes5k commented 8 years ago

It would be great if we could add a 'verbose' mode to the new log command that will print the hash values and data that was hashed for each task cache to help debug which changes are triggering a task to either use the cache or re-run. One idea is print things in a such a way that we can simply diff the logs of different runs to see what's changed.

This feature request comes to mind because I'm currently struggling with a pipeline where a very long running process keeps re-running, even though as far as I can tell nothing before it changes. I've stared at -trace nextflow.processor.TaskProcessor logs quite a bit, but with only limited success.

pditommaso commented 8 years ago

Yes, this could be useful. If you want to give it a try the relevant code is this method.

mes5k commented 8 years ago

Sounds good. I'll take a crack at this when I've got some spare cycles.

pditommaso commented 8 years ago

Nice. If you do, please commit on a branch different from master. Tx!

pditommaso commented 8 years ago

I've added a command line option -dump-hashes to print task hash keys to the stdout. It will be included in the next release.

mes5k commented 8 years ago

Finally got a chance to look at this and I think the output looks fine. However, I'd advocate that instead of writing to STDOUT write to a file in the same way that a trace file gets written (hashes.out, hashes.out.1, ...). I can imagine using -dump-hashes by default because when a long-running process re-runs having the history will be important for debugging.

pditommaso commented 8 years ago

I see your point but I would avoid to handle a separate mechanism to write this to a file. It should be managed by relying on the logging library plus the marker interface.

The idea is to have a configuration file that would allow to define separate logging files depending specify classes or markers.

mes5k commented 8 years ago

Ok - wasn't clear that you still wanted to go this direction. Looks easy enough to setup markers, but it'll take a bit to figure out how to enhance LoggerHelper to write marked messages to a separate file.

Do you want the default behavior of -dump-hashes to always be to stdout? If not, them maybe change the log level before releasing? If the ultimate goal is to log to a separate file, then perhaps better not to introduce behavior than to potentially change it in the future.

mes5k commented 8 years ago

Hi Paolo, here's a preliminary patch that logs the hashes to a file using SLF4J/Logback markers: https://github.com/mes5k/nextflow/tree/log_hashes

The main problem is that it currently always creates a "hashes" file, even if -dump-hashes isn't set. I don't see an immediately obvious way of getting the -dump-hashes flag from CmdRun into LoggerHelper apart from adding it to CliOptions, which doesn't seem right. If you can provide me some pointers I'll see what I can do.

Otherwise, I'd say that if you want to add additional markers then some additional refactoring will probably also be useful.

And tests. :)

pditommaso commented 8 years ago

Hi Mike, thanks for looking into this and sorry for the late reply. I'm a bit packed these days. Regarding this feature, instead of hard-coding the logic for the hashes dumping in the LoggerHelper, I was thinking to a more generic mechanism based on a configuration file that would extend the current one.

The idea is to have logger configuration file somehow similar to the following:

logger {
    classOrMarkerName {
       level = INFO|DEBUG|etc
       fileName = 'log-file-name'
       appender = <Rolling or any other appender supported by Logback> 
       otherAttributesDependingTheAppender = etc ..  
    }
}

One problem is that LoggerHelper is invoked before the nextflow config file is parsed. Thus it may required a separate configuration file.

Regarding the -dump-hashes it is important to have an explicit flag to avoid to waste cpus cycles when the hashes are not printed (default). It's true that the runOptions are not accessible from the LoggerHelper class, but the other way around could be possible. I mean when in the logger file the hashes marker is enabled this condition can be used to use the -dump-hashes flag or an equivalent one in the session object.

mes5k commented 8 years ago

Getting complicated! :)

It seems there are two main problems. First is getting -dump-hashes working with the logger (and I agree that the -dump-hashes flag is important). Second is providing a generic mechanism for adding marker-based logging.

To address the problem of only logging when -dump-hashes is set, I think it would be possible to add the hashMarker stuff AFTER the default logger has been created, which is to say add a public method to LoggerHelper that will create a marker and add it to the logger. Haven't tried this yet, but I'm guessing it'll work.

For the second problem of allowing custom marker based logging - I think that could be done using a normal logback configuration file. When I was googling around yesterday most of what I found involved modifications to logback config files. I suspect that a fair amount of the LoggerHelper code could go away in this case. However, a config file based setup means that this would all be compiled into the jar, so it wouldn't really be a runtime decision.

Maybe it helps to think about things as dynamically configured logging (-dump-hashes) and statically configured logging (default and other pre-defined markers)? Maybe have a config file for the static stuff and a few functions to dynamically enable markers?

Let me know what you think!

pditommaso commented 8 years ago

I think it would be possible to add the hashMarker stuff AFTER the default logger has been created

I don't think. I didn't try either but the marker can be created at any time, actually I think it should be created as a static field in the context of TaskProcessor. The filter doesn't need to access to the marker reference. It can work just on the marker name i.e. the string declared in the logging config file name.

Also have a look to the ch.qos.logback.classic.turbo.TurboFilter abstract class. It's a better alternative to standard filters.

For the second problem of allowing custom marker based logging - I think that could be done using a normal logback configuration file.

I was thinking at that, but I've realised that there too many details to be managed. I don't like it would be useful for a final user that just want to switch the logging for some specific component.

mes5k commented 8 years ago

I think we're on the same page. I'll see if I can get the logging for hashes configured in TaskProcessor.

I looked at TurboFilters (there's even a MarkerFilter provided), but it wasn't immediately obvious how to use them with a FileAppender.

mes5k commented 7 years ago

Hi Paolo, please take a look at this branch when you have a minute: https://github.com/mes5k/nextflow/tree/log_hashes

I think this separates the creation of marker based logs from normal logging in a reasonable way, will allow other markers to be added by developers, and achieves the behavior I'm looking for: hashes are logged to a separate file, but only when the -dump-hashes argument is set. Let me know what you think!

pditommaso commented 7 years ago

Hi Mike, apologies for the late reply. My notes:

Maker(s) should not defined in the Session object. I suggest to have a LoggerMarker class where they will declared as static public fields.
After the logger is configured this will eventually set a flag dumpHashes on the Launcher object which will propagate it to the Session object.
What about the logger configuration file? Do you agree on that approach or do you have a better alternative?

mes5k commented 7 years ago

Thanks Paolo.

Creating a LoggerMarker class is no problem.
I'm not sure I follow the comment about Launcher setting a dumpHashes flag, though. As I understand things dumpHashes is only an option for CmdRun, whose arguments aren't (I think) parsed by Launcher. Maybe I should be be calling LoggerHelper.addMarkerLogger(LoggerMarker.hashMarker) in CmdRun instead of Session?
I think a logger configuration file would probably be more in line with how logback is normally configured, but that basically means converting all of the LoggerHelper code into an XML config file. This would require a better understanding of both nextflow and logback than I have. I'm also not sure how easy it would be to dynamically create log files like what we're trying to do with dumpHashes. Given that, I'd probably stick with LoggerHelper as is.

pditommaso commented 7 years ago

I'm not sure I follow the comment about Launcher setting a dumpHashes flag, though. As I understand things dumpHashes is only an option for CmdRun, whose arguments aren't (I think) parsed by Launcher.

Nope, dumpHashes is also a Session attribute, see here, which enables the logging of the hashes. See here.

The rationale is to keep the current behaviour (by using the cmd line option) or redirect to a log file when specified by using the logger config file.

Thus the configLogger get the Launcher instance as parameter. See here, thus when parse the logger config it can set a (new) dumpHashes flag on the Launcher which the can used used to set the equivalent flag on the CmdRun object.

I think a logger configuration file would probably be more in line with how logback is normally configured, but that basically means converting all of the LoggerHelper code into an XML config file.

I don't think it's possible to implement to current logging behaviour with a XML or groovy logback configuration file. Actually this is the reason why I needed to wrote the LoggerHelper class. Moreover the XML/Groovy configuration would be too complicated for a final user. If one wants to enable the logging for a specific NF component, he/she doesn't want to spent time to learn how Logback config works.

mes5k commented 7 years ago

It took me a while to understand your message and it's because I've been working on a slightly different feature than what I think you're expecting! My branch above changes current behavior by never writing hashes to STDOUT and always writing them to a log file. It will only write the hashes to a file if the -dump-hashes flag is set. My branch also makes it easy for you to add additional markers and from whichever point in the code is convenient, enable them for separate logging.

What my branch does not do is allow users to modify a config file that determines whether the hashes should be written to a log file or not. Sorry if this wasn't clear!

Anyway, I know you don't like changing existing behavior, but the only time I can imagine writing hashes to STDOUT would be if they weren't being logged to a file and I was redirecting them to a file. The whole point (for me) of separating the hashes is to be able more easily identify the hashes and diff the outputs of runs to help diagnose problems. If the hashes are mixed in with the rest of STDOUT that becomes much harder.

If you really don't want to change current behavior then another option would be to add a -dump-hashes-file option and change this branch to use that instead of -dump-hashes.

pditommaso commented 7 years ago

I see. At some point I will need to such configurable logging mechanism to enable the tracing of different subcomponents of NF. For this reason I want to avoid to have a special file handling for the process hashes logging.

What about instead of using a marker and an associated filter to print the hashes to the stderr instead of stdout. Then it would be easy for you to redirect it to a file. At the same time it should be relatively easy to implement and generic enough for future extensions.

mes5k commented 7 years ago

What I've got should allow you to create new markers fairly easily. Here's what I think you'd do to add a new marker:

Add newMarker to LoggerMarker with the name "newmark".
Call LoggerHelper.addMarkerLogger(LoggerMarker.newMarker) somewhere convenient. This can be controlled with an argument (e.g. -log-newmark) or some other mechanism.
Modify log statements to use newMarker.
Now you should see any log message identified by newMarker written to a file called "newmark".

My branch doesn't necessarily unify all log generation logic in one place, but it is generic (not specific to hashes) and should allow you to fairly easily add new markers for different logs. Maybe that's enough until a grand unified logging solution can be developed?

Printing to stderr instead of stdout would meet my needs, but wouldn't that also require some custom logic in LoggerHelper if we're going to distinguish based on markers?

pditommaso commented 7 years ago

Printing to stderr instead of stdout would meet my needs, but wouldn't that also require some custom logic in LoggerHelper if we're going to distinguish based on markers?

Yes. I didn't check yet but I think it would just require a marker filter that print the output over stderr instead of the stdout. It should be straightforward.

mes5k commented 7 years ago

Ok, I implemented writing all log messages that have a marker to stderr. It took more than just a filter. I needed to add a whole separate ConsoleAppender. Anyway, the branch is here: https://github.com/mes5k/nextflow/tree/marker_stderr. Overall, I don't think this a great strategy as it adds a bunch of logic just for redirecting to stderr, but it can't be generalized any further. This is works for -dump-hashes, but won't work for much else.

For only about 40 more lines of code the log_hashes is a much more generalized solution. For demonstration purposes I made this commit (and reverted it) to show how additional markers that get written to files can be setup.

So, let me know which branch you prefer and I'll open a pull request. Or neither and I'll drop it. In any case, I really can't spend any more time on this particular issue.

pditommaso commented 7 years ago

Hi Mike, thanks a lot for your effort and suggestions. I will review carefully the two implementations and decide how to move on. However this will take some time because I will be in break for the next 4 weeks.

mes5k commented 7 years ago

Sounds good, enjoy your break!

nextflow-io / nextflow

Add caching hashes to new log command #211