john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.84k stars 210 forks source link

Making tldextract serializable and faster with Trie dictionary #339

Open leeprevost opened 1 month ago

leeprevost commented 1 month ago

First of all, let me say I'm a huge fan of @john-kurkowski 's tldextract. I am find it to be critical in doing work with the common crawl dataset and other projects.

I have found, quite by accident, that the package is not serializable but I believe could be modified quite easily to do so. and by doing so, I think it could speed the lookup function by ~20% or so. Serializability could be important for big data projects using spark broadcast or other distributed processing beyond a single core.

here is what I'm seeing:

import json
ext = tldextract.TLDExtract()
ext._extractor.tlds_incl_private_trie
Out[14]: <tldextract.tldextract.Trie at 0x2305deb1840>
json.dumps(ext._extractor.tlds_incl_private_trie)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-d4e5d6e8c9ec>", line 1, in <module>
    json.dumps(ext._extractor.tlds_incl_private_trie)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Trie is not JSON serializable

also:

import pickle
pickle.dumps(ext)
Traceback (most recent call last):
  File "C:\Users\lee\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-17-188183203f90>", line 1, in <module>
    pickle.dumps(extract)
_pickle.PicklingError: Can't pickle <function ext at 0x000002305F541480>: attribute lookup extract on __main__ failed

This seems to be because the underlying Trie is a custom class.

This could be resolved in several ways:

  1. Add a method to Trie class to tell it how to serialize/deserialize (a bit hack-ey in my opinion)
  2. Tell json or pickle how to serialize/deserialize. (again, a band-aid)
  3. Rewrite Trie class to be a standard dict (I think this is the best way a the dict would likely be faster - ~ 20%). ref(

An untested way to do this that would likely require no additional changes to the private calling class.

If this is of sufficient interest, I'd be glad to provide a PR.

Updated 10/11/24

class Trie(dict):
    """
    alt trie for tldextrct using python dict class
    """
    __getattr__ = dict.get  # key for allowing calling functions to use dot attrib calls.

    @staticmethod
    def create(
            match_list: Collection[str],
            reverse: bool = False, 
            is_private=False

    ) -> 'Trie':
        """Create a Trie from a list of matches and return its root node."""
        root_node = Trie()

        for m in match_list:
            root_node._add_match(m, is_private)

        return root_node

    def _add_match(self, match: str, reverse=False, is_private=False):
        """Append a suffix's labels to this Trie node."""
        labels = match.split(".")
        node = self
        if reverse:
              labels = reverse(labels)

        for label in labels:
            node = node.setdefault(label, {})
        node['is_private'] =  is_private
john-kurkowski commented 3 days ago

Thank you for the kind words!

I'm open to this. It sounds like not much change.

While I like dicts, I'm always wary of subclassing, especially such a common Python type. (I guess another option would be to avoid classes altogether, and pass dicts between standalone trie functions?)

/cc @elliotwutingfeng who wrote the trie

leeprevost commented 2 days ago

(I guess another option would be to avoid classes altogether, and pass dicts between standalone trie functions?)

Would be fairly straightforward. Something like this:

def make_tld_trie(matches, is_private=False):
    root = dict()
    for match in matches:
        node = root
        for label in match.split("."):
            node = node.setdefault(label, {})
        node['is_private'] = is_private
    return root

And: matches = ['com.github', 'com.kurkowski', 'net.prevost.lee'] make_tld_trie(matches)

{'com': {'github': {'is_private': False}, 'kurkowski': {'is_private': False}},
 'net': {'prevost': {'lee': {'is_private': False}}}}
leeprevost commented 2 days ago

While I like dicts, I'm always wary of subclassing, especially such a common Python type.

The case for subclassing would be if you want to have built in methods for search, adding matches, etc. Those methods would only be available at the root of the Trie. And they could help you simplify subsequent use of the Trie (for example in your calling class that uses the trie during the extract process.

Also, this does seem to be pythonic. Python gives an example of this in their UserDict class.

The case against it is probably backward compatibility as subclassing dict is a newer Python thing.

But, could go either way on this.

john-kurkowski commented 2 days ago

I guess my subclassing builtins worry came from posts like this SO and this blog. There are certainly nuances to subclassing. However, in this library's trie's case, it looks like we won't be tripping over the majority of gotchas, which concern juggling overridden methods, because we won't be overriding any methods.

john-kurkowski commented 2 days ago

If you tackle this, I have a couple requests please! 🙏

  1. Add test coverage that whatever you want to serialize is serializable.
  2. And deserializable, if that's your use case?
  3. Share benchmarks that the speed of the library is same or better.