Closed taketwo closed 6 years ago
Yes, there should definitely be a way to control this.
What's your opinion on adding a constructor argument to Checkpoint to control its verbosity?
We've had our discussions about how to control verbosity in the past. I don't like this suggestion because now we suddenly have verbosity parameters at several places. Who is really in control?
One possibility (which would be consistent with sklearn) is to have several verbosity levels and changing Checkpoint
so that it only prints if verbosity >= 2
. This solution is, however, very opaque. What if only want the Checkpoint
message but not PrintLog
?
Another solution is to have Checkpoint
have a sink
argument that controls where the print message goes to. By default, we would have sink=print
, but the user may change it to sink=logger.info
or sink=noop
(which would suppress the messages).
This would be an indirect way to address the original concern, but it has some advantages. Most notably, it gives the user control about what should happen to the message, but it also makes testing easier (no more capsys
necessary).
Of course, if we made that change, it should affect all instances where we currently use print.
Edit:
Edit: while we are at this, what about adding an option to save the history? Useful for resuming training later.
See #230
You second solution with sink
sounds reasonable. Is this something you want/have time to implement yourself, or should I give it a try?
Regarding history saving, I am aware of that new feature. I was proposing to make use of it (by adding an option to Checkpoint
to control whether it should save history as well).
You second solution with sink sounds reasonable. Is this something you want/have time to implement yourself, or should I give it a try?
We are always happy to get some help. But first I would would for @ottonemo's input regarding this idea.
Regarding history saving, I am aware of that new feature. I was proposing to make use of it (by adding an option to Checkpoint to control whether it should save history as well).
I see, this is probably a sensible addition.
I kinda like the idea of storing a checkpoint flag in the history and letting PrintLog
handle the printing because that would be a reasonable solution for all callbacks that need to indicate whenever something extraordinary was done. The obvious drawback is that it is limited to flags if one wants to keep the solution simple and that it can get confusing quite fast when using learn rate schedulers and checkpointing at the same time (which symbol / markup represents what).
I'm fine with a sink
parameter. Maybe we can provide a sink that takes the PringLog
instance from the net and calls record_epoch_suffix(msg)
with the message on it. This function would then append msg
to the usual print log output for that epoch (similar to how extra information in TQDM is handled).
Something like this:
class PrintLog(Callback):
# ...
def record_epoch_suffix(self, msg):
self.suffix_data.append(msg)
# ...
def on_epoch_end(...):
# ....
self._sink(tabulated.rsplit('\n', 1)[-1] + ' '.join(self.suffix_data), verbose)
self.suffix_data = []
def print_log_suffix(net, msg):
dict(net.callbacks_)['PrintLog'].record_epoch_suffix(msg)
Net(callbacks=[Checkpointing(sink=print_log_suffix)])
I kinda like the idea of storing a checkpoint flag in the history and letting PrintLog handle the printing because that would be a reasonable solution for all callbacks that need to indicate whenever something extraordinary was done. The obvious drawback is that it is limited to flags if one wants to keep the solution simple and that it can get confusing quite fast when using learn rate schedulers and checkpointing at the same time (which symbol / markup represents what).
Oh, I did not think that far! Sounds really useful. But I don't see why this should be limited to flags only. We may add "events" list to the epoch dict in the history. Then there will be History.record_event(desc)
. Checkpoint
would call it with a simple "checkpoint", and a learn rate scheduler may record something like "LR 0.01 → 0.001". And in the terminal we will see the following:
epoch test_loss train_loss dur events
------- ----------- ------------ ------ ---------
1 0.1177 0.9514 4.0282
2 0.1077 0.8544 4.0082 checkpoint
3 0.1107 0.7514 4.1202
4 0.1177 0.9514 4.0282 LR 0.01 → 0.001
Net(callbacks=[Checkpointing(sink=print_log_suffix)])
We could even think about a syntax like
Net(callbacks=[Checkpointing(sink='foo')])
where a string as sink would write to the corresponding column in the PrintLog
(which assumes there is a one).
We may add "events" list to the epoch dict in the history
I could see this getting confusing when there are many events during an epoch, and would prefer one column per item. This also makes the history easier to parse.
We could even think about a syntax like
Net(callbacks=[Checkpointing(sink='foo')])
where a string as sink would write to the corresponding column in the PrintLog (which assumes there is a one)
Do I understand right that you propose to bypass writing history?
I could see this getting confusing when there are many events during an epoch, and would prefer one column per item
You mean something like this?
epoch test_loss train_loss dur checkpoint learning_rate
------- ----------- ------------ ------ ------------- -------------
1 0.1177 0.9514 4.0282
2 0.1077 0.8544 4.0082 x
3 0.1107 0.7514 4.1202
4 0.1177 0.9514 4.0282 0.01 → 0.001
Looks ok, though if there are more scores, the table may get too wide.
Do I understand right that you propose to bypass writing history?
No, I meant via the history: if sink is a string, write to the history column with that name.
You mean something like this?
Yes, like this. Of course, the table could become wider (but maybe not, when all the events happen at the same time, width would be similar).
I like it better because I can do things like
[do_a() if cp else do_b() for cp in net.history[:, 'checkpoint']
which would be much more difficult when everything is in one column.
Regarding the column width, one has to pay attention to have short column names and values. If space becomes an issue, we have the tablefmt
and floatfmt
arguments on PrintLog
to help with that.
Regarding the column width, one has to pay attention to have short column names and values.
I see a small problem here then. Your proposed "history sink" has different expectations compared to other sinks who don't care much about the message length. For example, what should Checkpoint
print? With "history sink" perhaps a single True
bool is preferable. However, with the logger sink this should rather be a complete sentence "Checkpoint! Saving model to xyz." (like it is now).
While I like the ideas that are discussed I think we should re-structure the problem a bit. Currently I see two separate ideas that both solve a problem but are more or less independent.
First, there is the problem that certain callbacks break the PrintLog table. The solution for this would be to introduce a sink
parameter that enables the user to divert the output to somewhere else, e.g.
net.set_params(callbacks__Checkpoint__sink=skorch.utils.noop)
With this in place we can do lots of things (e.g. modifying PrintLog
to accept aux. information that is printed alongside the table).
Second, there is the problem of having access to raw data, e.g. when checkpointing happened or which learning rate was set by the LR scheduler. One idea to address this is to write the data to the history when a certain event happens (e.g. writing True
to the history when a checkpoint is written). This is complementary to the message shown to the user. Since PrintLog
prints most epoch columns by default we should also think to prefix those columns, e.g. with _
, and ignore them by default. In that case Checkpoint
would call net.history.record('_checkpoint', True)
if a checkpoint is written.
OK, let's tackle this step-by-step. I'll make a PR for the sink part first. Few questions:
PrintLog
, with support for net.verbose
?print
as a default? I suppose print
is okay now, but in future using "PrintLog auxiliary column sink" would be a better default.
As for the second part,
we should also think to prefix those columns, e.g. with _, and ignore them by default
There is a mechanism already in place to ignore columns in PrintLog
, why adding a second one? Besides, whether a column is to be ignored or not is only of a concern for PrintLog
, other users of the history should not care about this detail. By introducing _
prefix we force them to care:
"Ok, I need to access the
checkpoint
column in history. Hm... key not found? Ah, of course, this column is ignored byPrintLog
, so the name is actually_checkpoint
"!
I don't think this is how it should be.
do you want me to implement it in the same way as in PrintLog, with support for net.verbose?
I think consistency should be desired here. If there is duplicate code, it can be moved to utils.py
.
There is a mechanism already in place to ignore columns in PrintLog, why adding a second one?
You are probably alluding to the keys_ignored
parameter. The advantage of @ottonemo's proposal is that you can control whether the output should be printed directly when you add it to the history. So say you have a callback that writes something to history that you generally don't want to print. If another user uses it, they need to adjust their PrintLog
each time. Also, prefixing with _
comes natural to a Python user and immediately makes it clear that it is "private".
Ok, I need to access the checkpoint column in history. Hm... key not found?
If a user wants to access a value from history that is not printed, they probably need to look it up anyway. As long as it's well documented that a key "_checkpoint" is added, there should be no problem.
I see the issue that we are currently designing around PrintLog
even though it is just another callback and agree with @taketwo that this is probably a bad idea. If we introduce a pattern, such as a naming scheme for history values, then it should have a reason beyond 'but it helps us work around this issue in PrintLog
'.
As far as I can see, we are talking about logging sparse events. This is different to values such as train_loss
that are continuously logged. Maybe this is a feature that can be reflected in the name and, conveniently, filtered by PrintLog
as well.
What about event_new_lr
, event_checkpoint
, etc.?
Also, prefixing with _ comes natural to a Python user and immediately makes it clear that it is "private".
I agree, most Python developers should be able to guess the intent of _
prefixed columns even without consulting documentation. But if you adopt this convention, then batches
should be renamed to _batches
?
As far as I can see, we are talking about logging sparse events. Maybe this is a feature that can be reflected in the name... What about event_new_lr, event_checkpoint, etc.?
This look reasonable. I think it's similar in spirit to what I proposed before (sorry for self-citation):
We may add "events" list to the epoch dict in the history. Then there will be
history.record_event(desc)
Now that the configurable sink PR is out of the way, let me summarize the remaining ideas we had:
print_log_suffix
sink, messages go directly to the log tableevent_
history columnsI personally like number 1 and 3. Number 1 is useful for printing arbitrary user-friendly messages without screwing up the log table layout. This does not touch history. Number 3 is a structured approach to dealing with sparse events. It allows to record (and later access) information about such events in the history. Further, it allows to print separate columns for each event type in the output log. Note that here I mean very succinct information, like True/False flag for checkpoint or single number for new learning rate. These are not messages.
Both 1 and 3 can have overlapping use-cases, but ultimately they are solving different problems, so can be added to skorch.
I've implemented 1 and 3 and have been using them for the last week. Here is an example of a learning session:
First, the events. I have one regular event "canon" (happening every second epoch in this example) and occasional checkpointing "cp". The callbacks are writing to the event_canon
and event_cp
respectively, and the logger does two things: a) outputs "+" for True
and nothing for False; b) puts event_
columns to the right.
Now stepping a little back, I'm not sure how important b) is. Do the events need to go the the right? Why not letting them mix with the other columns? If so, then the whole convention of having "event_" prefixes is not needed. The only needed modification for the logger then is a).
Second, messages. Looks reasonable for me. In the shown example, in principle, this message can instead be an event. However, it actually happens only a single time during a learning session, so a single time message feels more appropriate.
What are your thoughts on this?
Could you show what a user would need to do to get this output (i.e. the user code, not the actual implementation).
Regarding the order of columns, it certainly gets more complicated to implement if we want it. On the other hand, what I don't like is if my scores are interspersed with other columns. E.g. I would strong dislike this:
epoch train_loss valid_loss auc_score event_x f1_score
Regarding messages, if they get too long, there would be a line break, disturbing the "neat" look and thus defeating the initial purpose, so I would probably try to avoid them. But overall, I don't have a strong opinion on this -- if it doesn't make things too complicated for the user, I would be fine.
The message was produced with the following code inside one of the callbacks:
dict(net.callbacks_)["print_log"].record_epoch_message("Enabled hinge loss")
The events in the Checkpoint
callback are created like this:
net.history[-1]['event_cp'] = do_checkpoint
what I don't like is if my scores are interspersed with other columns
OK, then we should go with the "event_" prefix way.
Regarding the order of columns, it certainly gets more complicated to implement if we want it.
It's more complicated, but still it's less than 10 lines of diff in my implementation.
Regarding messages, if they get too long, there would be a line break, disturbing the "neat" look and thus defeating the initial purpose, so I would probably try to avoid them.
Sure, the user (and us, if we choose to send something to the message sink by default) should be reasonable about the length of the message.
Thanks, this seems to be reasonable. I would propose you put your code into a PR. It could just take some time to be merged because I would wait for @ottonemo's opinion, who currently might not have time.
OK, I will start with the events. Let's confirm a few details before I send a PR.
event_
dur
?event_
prefix is stripped from the column header to make it more compactCheckpoint
to
a. write events to event_cp
b. send messages to noop
sink by defaultRegarding number 4, we can also think about cycling through different symbols ("+", "*", "x") when there are multiple event columns, the same way different colors are used for metric columns. But this can be a future PR.
That sounds reasonable for a starting point.
One other addition mentioned early in this thread is checkpointing history alongside model parameters. I suppose this should be configurable, both in whether the history should be saved or not, and to which file. This means two more arguments for the Checkpoint
constructor. What's you opinion?
Could this not be done with one parameter? It could be something like history_target
that is None
by default and if not behaves like target
, just for save_history
.
That would mean there is no default filename for history. Which is totally okay I guess.
Do we want to rename target
parameter to params_target
for consistency? Such that params_target
controls where net.save_params
is writing, and params_history
controls where net.save_history
is writing.
I'm not too happy about the choice of target
as parameter name, maybe we should find a better name and use it consistently across skorch.
Taking a step back, I wonder whether we should add the possibility for Checkpoint
to pickle the whole model instead of using save_params
. That way, the history would be saved too. Of course, we could have both, pickling the whole and saving history alongside the parameters, I just wonder from the point of view of a user whether both are needed.
There are tradeoffs between pickling and save_
functions. Personally, I use save_
because this allows me to tweak the code before resuming training. Thus I'd certainly be unhappy if Checkpoint
only supported pickling.
Thus I'd certainly be unhappy if Checkpoint only supported pickling.
No, I meant whether it makes sense to add an option to pickle the whole model instead of only the parameters.
@ottonemo can we have your opinion here regarding the last 6 messages?
Let's continue the discussion regarding history saving in #289.
This issue is now resolved since #274 is merged.
For the record, we had one more idea in this thread, "epoch messages". I have it implemented in my fork, however so far I did not find much use for it. I'll open a new issue/PR if my experience changes.
Currently, the
Checkpoint
callback prints a message to terminal every time the model is saved. These messages get mixed with thePrintLog
output, breaking the neat progress log table. One can setnet.verbosity=False
, but this disables output from both callbacks.What's your opinion on adding a constructor argument to
Checkpoint
to control its verbosity?One other idea I had is to let
Checkpoint
write whether the model was saved or not to the network history (say, under 'checkpoint' key).PrintLog
will then have a separate column indicating that. Even fancier, it can analyze these keys (the same way it automatically analyzes '_best' keys) and highlight epoch number (with bold or some color) if checkpoint occurred.Edit: while we are at this, what about adding an option to save the history? Useful for resuming training later.