Open leeprevost opened 1 month ago
Thank you for the kind words!
I'm open to this. It sounds like not much change.
While I like dict
s, I'm always wary of subclassing, especially such a common Python type. (I guess another option would be to avoid classes altogether, and pass dict
s between standalone trie functions?)
/cc @elliotwutingfeng who wrote the trie
(I guess another option would be to avoid classes altogether, and pass dicts between standalone trie functions?)
Would be fairly straightforward. Something like this:
def make_tld_trie(matches, is_private=False):
root = dict()
for match in matches:
node = root
for label in match.split("."):
node = node.setdefault(label, {})
node['is_private'] = is_private
return root
And:
matches = ['com.github', 'com.kurkowski', 'net.prevost.lee']
make_tld_trie(matches)
{'com': {'github': {'is_private': False}, 'kurkowski': {'is_private': False}},
'net': {'prevost': {'lee': {'is_private': False}}}}
While I like dicts, I'm always wary of subclassing, especially such a common Python type.
The case for subclassing would be if you want to have built in methods for search, adding matches, etc. Those methods would only be available at the root of the Trie. And they could help you simplify subsequent use of the Trie (for example in your calling class that uses the trie during the extract process.
Also, this does seem to be pythonic. Python gives an example of this in their UserDict class.
The case against it is probably backward compatibility as subclassing dict is a newer Python thing.
But, could go either way on this.
I guess my subclassing builtins worry came from posts like this SO and this blog. There are certainly nuances to subclassing. However, in this library's trie's case, it looks like we won't be tripping over the majority of gotchas, which concern juggling overridden methods, because we won't be overriding any methods.
If you tackle this, I have a couple requests please! 🙏
First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.
I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.
here is what I'm seeing:
also:
This seems to be because the underlying Trie is a custom class.
This could be resolved in several ways:
An untested way to do this that would likely require no additional changes to the private calling class.
If this is of sufficient interest, I'd be glad to provide a PR.