I don't think the document_count argument serves a purpose as a user-settable constructor argument. There is no use case for changing the value by hand.
However, because we use it in TF-IDF computation (and only there) it is part of the state of the Tokenizer and it should be saved/loaded during serialization. Removing the argument entirely isn't correct for use cases when ones tries to use TF-IDF after loading a saved Tokenizer.
So I think we should move this arg to **kwargs like we do for some private layer arguments. That way, it isn't exposed to users, it isn't documented, but it's still saved as part of a serialized Tokenizer.
Background: In #106, I removed the document_count argument (if supplied, will be ignored with a warning). Also, I modified the implementation of the tokenizer_from_json method to set the document_count attribute directly on the tokenizer object, so that the state of the tokenizer would be preserved after saving and loading. But as mentioned above, @fchollet preferred moving the document_count argument to **kwargs instead of removing it. So I created this PR. Now we have two PRs that address the same issue: #106 and this one. We can merge one of them and close the other one.
Related Issues
Fixes #105.
PR Overview
[n] This PR requires new unit tests [y/n] (make sure tests are included)
[n] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[y] This PR is backwards compatible [y/n]
[n] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)
Summary
This PR moves the
document_count
constructor argument of theTokenizer
class to**kwargs
.Quoting @fchollet:
Background: In #106, I removed the
document_count
argument (if supplied, will be ignored with a warning). Also, I modified the implementation of thetokenizer_from_json
method to set thedocument_count
attribute directly on thetokenizer
object, so that the state of the tokenizer would be preserved after saving and loading. But as mentioned above, @fchollet preferred moving thedocument_count
argument to**kwargs
instead of removing it. So I created this PR. Now we have two PRs that address the same issue: #106 and this one. We can merge one of them and close the other one.Related Issues
Fixes #105.
PR Overview