skorch-dev / skorch

A scikit-learn compatible neural network library that wraps PyTorch
BSD 3-Clause "New" or "Revised" License
5.86k stars 390 forks source link

Checkpoint callback verbosity #246

Closed taketwo closed 6 years ago

taketwo commented 6 years ago

Currently, the Checkpoint callback prints a message to terminal every time the model is saved. These messages get mixed with the PrintLog output, breaking the neat progress log table. One can set net.verbosity=False, but this disables output from both callbacks.

What's your opinion on adding a constructor argument to Checkpoint to control its verbosity?

One other idea I had is to let Checkpoint write whether the model was saved or not to the network history (say, under 'checkpoint' key). PrintLog will then have a separate column indicating that. Even fancier, it can analyze these keys (the same way it automatically analyzes '_best' keys) and highlight epoch number (with bold or some color) if checkpoint occurred.

Edit: while we are at this, what about adding an option to save the history? Useful for resuming training later.

benjamin-work commented 6 years ago

Yes, there should definitely be a way to control this.

What's your opinion on adding a constructor argument to Checkpoint to control its verbosity?

We've had our discussions about how to control verbosity in the past. I don't like this suggestion because now we suddenly have verbosity parameters at several places. Who is really in control?

One possibility (which would be consistent with sklearn) is to have several verbosity levels and changing Checkpoint so that it only prints if verbosity >= 2. This solution is, however, very opaque. What if only want the Checkpoint message but not PrintLog?

Another solution is to have Checkpoint have a sink argument that controls where the print message goes to. By default, we would have sink=print, but the user may change it to sink=logger.info or sink=noop (which would suppress the messages).

This would be an indirect way to address the original concern, but it has some advantages. Most notably, it gives the user control about what should happen to the message, but it also makes testing easier (no more capsys necessary).

Of course, if we made that change, it should affect all instances where we currently use print.

Edit:

Edit: while we are at this, what about adding an option to save the history? Useful for resuming training later.

See #230

taketwo commented 6 years ago

You second solution with sink sounds reasonable. Is this something you want/have time to implement yourself, or should I give it a try?

Regarding history saving, I am aware of that new feature. I was proposing to make use of it (by adding an option to Checkpoint to control whether it should save history as well).

benjamin-work commented 6 years ago

You second solution with sink sounds reasonable. Is this something you want/have time to implement yourself, or should I give it a try?

We are always happy to get some help. But first I would would for @ottonemo's input regarding this idea.

Regarding history saving, I am aware of that new feature. I was proposing to make use of it (by adding an option to Checkpoint to control whether it should save history as well).

I see, this is probably a sensible addition.

ottonemo commented 6 years ago

I kinda like the idea of storing a checkpoint flag in the history and letting PrintLog handle the printing because that would be a reasonable solution for all callbacks that need to indicate whenever something extraordinary was done. The obvious drawback is that it is limited to flags if one wants to keep the solution simple and that it can get confusing quite fast when using learn rate schedulers and checkpointing at the same time (which symbol / markup represents what).

I'm fine with a sink parameter. Maybe we can provide a sink that takes the PringLog instance from the net and calls record_epoch_suffix(msg) with the message on it. This function would then append msg to the usual print log output for that epoch (similar to how extra information in TQDM is handled).

Something like this:


class PrintLog(Callback):
    # ...
    def record_epoch_suffix(self, msg):
        self.suffix_data.append(msg)
   # ...
   def on_epoch_end(...):
       # ....
       self._sink(tabulated.rsplit('\n', 1)[-1] + ' '.join(self.suffix_data), verbose)
       self.suffix_data = []

def print_log_suffix(net, msg):
    dict(net.callbacks_)['PrintLog'].record_epoch_suffix(msg)

Net(callbacks=[Checkpointing(sink=print_log_suffix)])
taketwo commented 6 years ago

I kinda like the idea of storing a checkpoint flag in the history and letting PrintLog handle the printing because that would be a reasonable solution for all callbacks that need to indicate whenever something extraordinary was done. The obvious drawback is that it is limited to flags if one wants to keep the solution simple and that it can get confusing quite fast when using learn rate schedulers and checkpointing at the same time (which symbol / markup represents what).

Oh, I did not think that far! Sounds really useful. But I don't see why this should be limited to flags only. We may add "events" list to the epoch dict in the history. Then there will be History.record_event(desc). Checkpoint would call it with a simple "checkpoint", and a learn rate scheduler may record something like "LR 0.01 → 0.001". And in the terminal we will see the following:

  epoch    test_loss    train_loss     dur    events
-------  -----------  ------------  ------  ---------
      1       0.1177        0.9514  4.0282
      2       0.1077        0.8544  4.0082   checkpoint
      3       0.1107        0.7514  4.1202 
      4       0.1177        0.9514  4.0282   LR 0.01 → 0.001
BenjaminBossan commented 6 years ago

Net(callbacks=[Checkpointing(sink=print_log_suffix)])

We could even think about a syntax like

Net(callbacks=[Checkpointing(sink='foo')])

where a string as sink would write to the corresponding column in the PrintLog (which assumes there is a one).

We may add "events" list to the epoch dict in the history

I could see this getting confusing when there are many events during an epoch, and would prefer one column per item. This also makes the history easier to parse.

taketwo commented 6 years ago

We could even think about a syntax like Net(callbacks=[Checkpointing(sink='foo')]) where a string as sink would write to the corresponding column in the PrintLog (which assumes there is a one)

Do I understand right that you propose to bypass writing history?

I could see this getting confusing when there are many events during an epoch, and would prefer one column per item

You mean something like this?

  epoch    test_loss    train_loss     dur    checkpoint   learning_rate
-------  -----------  ------------  ------  -------------  -------------
      1       0.1177        0.9514  4.0282
      2       0.1077        0.8544  4.0082             x
      3       0.1107        0.7514  4.1202 
      4       0.1177        0.9514  4.0282                  0.01 → 0.001

Looks ok, though if there are more scores, the table may get too wide.

benjamin-work commented 6 years ago

Do I understand right that you propose to bypass writing history?

No, I meant via the history: if sink is a string, write to the history column with that name.

You mean something like this?

Yes, like this. Of course, the table could become wider (but maybe not, when all the events happen at the same time, width would be similar).

I like it better because I can do things like

[do_a() if cp else do_b() for cp in net.history[:, 'checkpoint']

which would be much more difficult when everything is in one column.

Regarding the column width, one has to pay attention to have short column names and values. If space becomes an issue, we have the tablefmt and floatfmt arguments on PrintLog to help with that.

taketwo commented 6 years ago

Regarding the column width, one has to pay attention to have short column names and values.

I see a small problem here then. Your proposed "history sink" has different expectations compared to other sinks who don't care much about the message length. For example, what should Checkpoint print? With "history sink" perhaps a single True bool is preferable. However, with the logger sink this should rather be a complete sentence "Checkpoint! Saving model to xyz." (like it is now).

ottonemo commented 6 years ago

While I like the ideas that are discussed I think we should re-structure the problem a bit. Currently I see two separate ideas that both solve a problem but are more or less independent.

First, there is the problem that certain callbacks break the PrintLog table. The solution for this would be to introduce a sink parameter that enables the user to divert the output to somewhere else, e.g.

net.set_params(callbacks__Checkpoint__sink=skorch.utils.noop)

With this in place we can do lots of things (e.g. modifying PrintLog to accept aux. information that is printed alongside the table).

Second, there is the problem of having access to raw data, e.g. when checkpointing happened or which learning rate was set by the LR scheduler. One idea to address this is to write the data to the history when a certain event happens (e.g. writing True to the history when a checkpoint is written). This is complementary to the message shown to the user. Since PrintLog prints most epoch columns by default we should also think to prefix those columns, e.g. with _, and ignore them by default. In that case Checkpoint would call net.history.record('_checkpoint', True) if a checkpoint is written.

taketwo commented 6 years ago

OK, let's tackle this step-by-step. I'll make a PR for the sink part first. Few questions:

I suppose print is okay now, but in future using "PrintLog auxiliary column sink" would be a better default.

As for the second part,

we should also think to prefix those columns, e.g. with _, and ignore them by default

There is a mechanism already in place to ignore columns in PrintLog, why adding a second one? Besides, whether a column is to be ignored or not is only of a concern for PrintLog, other users of the history should not care about this detail. By introducing _ prefix we force them to care:

"Ok, I need to access the checkpoint column in history. Hm... key not found? Ah, of course, this column is ignored by PrintLog, so the name is actually _checkpoint"!

I don't think this is how it should be.

benjamin-work commented 6 years ago

do you want me to implement it in the same way as in PrintLog, with support for net.verbose?

I think consistency should be desired here. If there is duplicate code, it can be moved to utils.py.

There is a mechanism already in place to ignore columns in PrintLog, why adding a second one?

You are probably alluding to the keys_ignored parameter. The advantage of @ottonemo's proposal is that you can control whether the output should be printed directly when you add it to the history. So say you have a callback that writes something to history that you generally don't want to print. If another user uses it, they need to adjust their PrintLog each time. Also, prefixing with _ comes natural to a Python user and immediately makes it clear that it is "private".

Ok, I need to access the checkpoint column in history. Hm... key not found?

If a user wants to access a value from history that is not printed, they probably need to look it up anyway. As long as it's well documented that a key "_checkpoint" is added, there should be no problem.

ottonemo commented 6 years ago

I see the issue that we are currently designing around PrintLog even though it is just another callback and agree with @taketwo that this is probably a bad idea. If we introduce a pattern, such as a naming scheme for history values, then it should have a reason beyond 'but it helps us work around this issue in PrintLog'.

As far as I can see, we are talking about logging sparse events. This is different to values such as train_loss that are continuously logged. Maybe this is a feature that can be reflected in the name and, conveniently, filtered by PrintLog as well.

What about event_new_lr, event_checkpoint, etc.?

taketwo commented 6 years ago

Also, prefixing with _ comes natural to a Python user and immediately makes it clear that it is "private".

I agree, most Python developers should be able to guess the intent of _ prefixed columns even without consulting documentation. But if you adopt this convention, then batches should be renamed to _batches?

As far as I can see, we are talking about logging sparse events. Maybe this is a feature that can be reflected in the name... What about event_new_lr, event_checkpoint, etc.?

This look reasonable. I think it's similar in spirit to what I proposed before (sorry for self-citation):

We may add "events" list to the epoch dict in the history. Then there will be history.record_event(desc)

taketwo commented 6 years ago

Now that the configurable sink PR is out of the way, let me summarize the remaining ideas we had:

  1. print_log_suffix sink, messages go directly to the log table
  2. history sink, messages go to the history column with a given name
  3. writing sparse events to event_ history columns

I personally like number 1 and 3. Number 1 is useful for printing arbitrary user-friendly messages without screwing up the log table layout. This does not touch history. Number 3 is a structured approach to dealing with sparse events. It allows to record (and later access) information about such events in the history. Further, it allows to print separate columns for each event type in the output log. Note that here I mean very succinct information, like True/False flag for checkpoint or single number for new learning rate. These are not messages.

Both 1 and 3 can have overlapping use-cases, but ultimately they are solving different problems, so can be added to skorch.

taketwo commented 6 years ago

I've implemented 1 and 3 and have been using them for the last week. Here is an example of a learning session:

image

First, the events. I have one regular event "canon" (happening every second epoch in this example) and occasional checkpointing "cp". The callbacks are writing to the event_canon and event_cp respectively, and the logger does two things: a) outputs "+" for True and nothing for False; b) puts event_ columns to the right.

Now stepping a little back, I'm not sure how important b) is. Do the events need to go the the right? Why not letting them mix with the other columns? If so, then the whole convention of having "event_" prefixes is not needed. The only needed modification for the logger then is a).

Second, messages. Looks reasonable for me. In the shown example, in principle, this message can instead be an event. However, it actually happens only a single time during a learning session, so a single time message feels more appropriate.

What are your thoughts on this?

benjamin-work commented 6 years ago

Could you show what a user would need to do to get this output (i.e. the user code, not the actual implementation).

Regarding the order of columns, it certainly gets more complicated to implement if we want it. On the other hand, what I don't like is if my scores are interspersed with other columns. E.g. I would strong dislike this:

    epoch   train_loss   valid_loss   auc_score   event_x    f1_score

Regarding messages, if they get too long, there would be a line break, disturbing the "neat" look and thus defeating the initial purpose, so I would probably try to avoid them. But overall, I don't have a strong opinion on this -- if it doesn't make things too complicated for the user, I would be fine.

taketwo commented 6 years ago

The message was produced with the following code inside one of the callbacks:

dict(net.callbacks_)["print_log"].record_epoch_message("Enabled hinge loss")

The events in the Checkpoint callback are created like this:

net.history[-1]['event_cp'] = do_checkpoint

what I don't like is if my scores are interspersed with other columns

OK, then we should go with the "event_" prefix way.

Regarding the order of columns, it certainly gets more complicated to implement if we want it.

It's more complicated, but still it's less than 10 lines of diff in my implementation.

Regarding messages, if they get too long, there would be a line break, disturbing the "neat" look and thus defeating the initial purpose, so I would probably try to avoid them.

Sure, the user (and us, if we choose to send something to the message sink by default) should be reasonable about the length of the message.

benjamin-work commented 6 years ago

Thanks, this seems to be reasonable. I would propose you put your code into a PR. It could just take some time to be merged because I would wait for @ottonemo's opinion, who currently might not have time.

taketwo commented 6 years ago

OK, I will start with the events. Let's confirm a few details before I send a PR.

  1. We adopt the convention that event entries in history are prefixed with event_
  2. The event columns are grouped together and put to the right. Should they go before or after dur?
  3. The event_ prefix is stripped from the column header to make it more compact
  4. Events are represented by "+" symbol
  5. We update Checkpoint to a. write events to event_cp b. send messages to noop sink by default

Regarding number 4, we can also think about cycling through different symbols ("+", "*", "x") when there are multiple event columns, the same way different colors are used for metric columns. But this can be a future PR.

benjamin-work commented 6 years ago

That sounds reasonable for a starting point.

taketwo commented 6 years ago

One other addition mentioned early in this thread is checkpointing history alongside model parameters. I suppose this should be configurable, both in whether the history should be saved or not, and to which file. This means two more arguments for the Checkpoint constructor. What's you opinion?

benjamin-work commented 6 years ago

Could this not be done with one parameter? It could be something like history_target that is None by default and if not behaves like target, just for save_history.

taketwo commented 6 years ago

That would mean there is no default filename for history. Which is totally okay I guess.

Do we want to rename target parameter to params_target for consistency? Such that params_target controls where net.save_params is writing, and params_history controls where net.save_history is writing.

BenjaminBossan commented 6 years ago

I'm not too happy about the choice of target as parameter name, maybe we should find a better name and use it consistently across skorch.

Taking a step back, I wonder whether we should add the possibility for Checkpoint to pickle the whole model instead of using save_params. That way, the history would be saved too. Of course, we could have both, pickling the whole and saving history alongside the parameters, I just wonder from the point of view of a user whether both are needed.

taketwo commented 6 years ago

There are tradeoffs between pickling and save_ functions. Personally, I use save_ because this allows me to tweak the code before resuming training. Thus I'd certainly be unhappy if Checkpoint only supported pickling.

BenjaminBossan commented 6 years ago

Thus I'd certainly be unhappy if Checkpoint only supported pickling.

No, I meant whether it makes sense to add an option to pickle the whole model instead of only the parameters.

taketwo commented 6 years ago

@ottonemo can we have your opinion here regarding the last 6 messages?

ottonemo commented 6 years ago

Let's continue the discussion regarding history saving in #289.

ottonemo commented 6 years ago

This issue is now resolved since #274 is merged.

taketwo commented 6 years ago

For the record, we had one more idea in this thread, "epoch messages". I have it implemented in my fork, however so far I did not find much use for it. I'll open a new issue/PR if my experience changes.