Open egpbos opened 3 years ago
The label encoder can easily be generated from a dataset (e.g. dataset.Flickr8KData
) using the method init_vocabulary()
. It will then be available from the global variable tokenizer
. ⚠️ The vocabulary should be initialized on the training set.
Originally, I was creating the LabelEncoder every time a model is trained. The catch is that if you use a pipeline composed of two pretrained models, you have no quarantee (as far as I know) that the LabelEncoders will use the same index for the same label. The simplest solution seemed to generate the LabelEncoder once for all.
It would be great if the LabelEncoder could be generated automatically but now that I think about it, there might be some issues we need to consider:
Regarding 1., I think the best is to generate it once for all. If scikit-learn is updated, I would leave it the user to decide what to do (keep the old file or regenerate it, with the possibility that it breaks previously trained models).
For 2., might be better to store the files in a language and dataset specific way (e.g. under data/datasets/<dataset_name>/label_encoder_<language>.pkl
).
We had a discussion on this during our last meeting. I believe the nicest solution would be to leave the warning as there isn't an obvious way to solve this. I think it is best to let the user decide what to do based on the specific circumstances.
To make updating the file more straightforward though, I suggest modifying the code to generate a fiile automatically when it doesn't exist. If someone wants to update the file, all that he will need to do is delete the old one and a new one will be re-created the first time a model is trained. As we now support more than one dataset, I also suggest we make this file(s) dataset dependent (stored under <root>/data/datasets/<dataset_name>
. This way, we can also handle dataset specificities (e.g. flickr8k having different languages or spokencoco different split schemes).
Something that will also be nice to have but would require more effort, is to make loading textual captions optional. For most experiments, we actually don't need the captions nor the label encoder. I am not sure how simple that would be to implement though.
@egpbos any thoughts on this?
To make updating the file more straightforward though, I suggest modifying the code to generate a fiile automatically when it doesn't exist. If someone wants to update the file, all that he will need to do is delete the old one and a new one will be re-created the first time a model is trained. As we now support more than one dataset, I also suggest we make this file(s) dataset dependent (stored under
/data/datasets/ . This way, we can also handle dataset specificities (e.g. flickr8k having different languages or spokencoco different split schemes).
Sounds good to me!
Something that will also be nice to have but would require more effort, is to make loading textual captions optional. For most experiments, we actually don't need the captions nor the label encoder. I am not sure how simple that would be to implement though.
The first thing I would think of is to "lazy" initialize these things, i.e. initialize them on their first use. Maybe we can do this most elegantly by turning them into properties. For the label encoder that would look something like:
class DataThing:
...
@property
def le(self): # is it still called `le`?
try:
return self._le
except AttributeError:
# _le was not yet initialized, let's do that first
self._le = create_label_encoder_somehow()
return self._le
...
Then, when the class calls for self.le
it will just get the internal _le
once it's initialized the first time. The advantage of this is that when the label encoder is not used in a class, it will just automatically not be initialized and when it is used it automatically is initialized. To the user, it doesn't matter, because they just call data_thing.le
or not. So the other advantage is that you don't have to change the calling code.
I think the code for reproducing the label encoders may be missing from the repository. Perhaps we discussed this before, but I forgot why this is.
I ran into this pickled LabelEncoder version mismatch warning again because the pinned scikit-learn version (0.22.1) wouldn't install on a newer Python version in a new environment I had created. It's a bit annoying anyway, maintenance-wise, to have to leave the scikit-learn dependency pinned. Just using the newest available version typically is more convenient. However, if we don't pin, we keep getting that annoying warning.
How much time does it cost to generate a label_encoders.pkl file? And what data is necessary for it?
If it is not too costly, then an alternative to shipping it with the package would be to just generate it on first usage. That would also take care of the warning automatically, except when people update scikit-learn.
To make things totally airtight, we could have the encoders be regenerated then as well. We could set the warning to throw as an exception, catch that and then regenerate.
Overengineering? Probably :D Not too much effort, though. At least, if the label_encoders generation process is not too complicated.