keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

Tokenizer constructor: Move document_count to **kwargs #313

Closed soroushj closed 3 years ago

soroushj commented 3 years ago

Summary

This PR moves the document_count constructor argument of the Tokenizer class to **kwargs.

Quoting @fchollet:

I don't think the document_count argument serves a purpose as a user-settable constructor argument. There is no use case for changing the value by hand.

However, because we use it in TF-IDF computation (and only there) it is part of the state of the Tokenizer and it should be saved/loaded during serialization. Removing the argument entirely isn't correct for use cases when ones tries to use TF-IDF after loading a saved Tokenizer.

So I think we should move this arg to **kwargs like we do for some private layer arguments. That way, it isn't exposed to users, it isn't documented, but it's still saved as part of a serialized Tokenizer.

Background: In #106, I removed the document_count argument (if supplied, will be ignored with a warning). Also, I modified the implementation of the tokenizer_from_json method to set the document_count attribute directly on the tokenizer object, so that the state of the tokenizer would be preserved after saving and loading. But as mentioned above, @fchollet preferred moving the document_count argument to **kwargs instead of removing it. So I created this PR. Now we have two PRs that address the same issue: #106 and this one. We can merge one of them and close the other one.

Related Issues

Fixes #105.

PR Overview