python-validators / validators

Python Data Validation for Humans™.
MIT License
958 stars 152 forks source link

feat: cache IANA TLDs for faster lookups #390

Closed salty-horse closed 1 month ago

salty-horse commented 2 months ago

Follow-up to the discussion in #362.

I wasn't sure what was meant by using dataclass, as this isn't a data-first class. I used a regular class, instead.

One thing that's obviously missing is tests. I'm not familiar with pytest, and don't know how to re-run the existing domain tests after running the new "load" function.

Here are some basic timing results:

$ python -m timeit -s 'import validators' 'validators.domain("example.com", consider_tld=True)'
5000 loops, best of 5: 63.5 usec per loop
$ python -m timeit -s 'import validators; validators.load_iana_tlds_to_memory()' 'validators.domain("example.com", consider_tld=True)'
50000 loops, best of 5: 4.63 usec per loop

$ python -m timeit -s 'import validators' 'validators.domain("example.blabla", consider_tld=True)'
1000 loops, best of 5: 200 usec per loop
$ python -m timeit -s 'import validators; validators.load_iana_tlds_to_memory()' 'validators.domain("example.blabla", consider_tld=True)'
20000 loops, best of 5: 16.5 usec per loop
yozachar commented 2 months ago

Hi, validators already uses an environment variable to decide whether or not to throw errors, so I thought (again that) we could do something like this instead:

class _TLDList:
    """Cache IANA TLDs."""
    cache = set[str]()

def _load_tld_to_memory(tld_file_path: Path):
    """Load IANA TLDs to memory."""
    if not _TLDList.cache:
        with tld_file_path.open() as tld_f:
            _ = next(tld_f)  # ignore the first line
            _TLDList.cache = set(line.strip() for line in tld_f)
    return _TLDList.cache

def _iana_tld():
    """Provide IANA TLDs."""
    # # source: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
    tld_file_path = Path(__file__).parent.joinpath("_tld.txt")

    if environ.get("LOAD_TLD_TO_MEMORY", "False") == "True":
        return _load_tld_to_memory(tld_file_path)

    with tld_file_path.open() as tld_f:
        _ = next(tld_f)  # ignore the first line
        for line in tld_f:
            yield line.strip()

Environment LOAD_TLD_TO_MEMORY is the pivot here.

salty-horse commented 2 months ago

Some questions:

  1. Where should the env var be documented? In the API reference for the domain validator, or in "Install and Use"?
  2. The names of both RAISE_VALIDATION_ERROR and LOAD_TLD_TO_MEMORY are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables. Shouldn't they include a prefix such as PYTHON_VALIDATORS_? How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True' early in the project that uses it?
  3. Is the code above how you want to structure it? With private global functions instead of methods inside the private class. I think it's cleaner to open the file in one place, but I can just copy your code if that's what you want.
  4. Also about your code, I noticed that the lowest supported Python version is 3.8, which doesn't support type hinting like set[str].
  5. Still unclear how to test this with pytest by reusing the existing tests.
yozachar commented 2 months ago
  1. I was planning to put it on the Install and Use page, but I think, another page can be added say Environment Variables. It is probably better to update the documentation in a followup PR.
  2. ... are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables.

    Very true! You can for now set the environment variable as PYVLD_LOAD_TLD_TO_MEMORY. (Is PYVLD_ a good prefix? Short ones are preferable) Again, another PR can bring more consistency.

    How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True' early in the project that uses it?

    As the documentation shows, it must be set via the shell, before running your program.

  3. You have the creative freedom, @classmethods are totally fine. You no longer need to expose any functions to load TLDs.
  4. Ah, good catch! My bad.
  5. Currently, all the tests pass 'cause TLDs aren't being considered while testing. I'll take care of writing tests for the other environment variable too.
salty-horse commented 1 month ago

BTW, I wonder if it's worth it to hardcode 'com', 'org', 'net' outside of the file, to catch the most common TLD's quickly.

salty-horse commented 1 month ago

Pushed a commit with changes, including my hard-coded list to cover the common TLDs before trying the file.

For the documentation, I think it should be covered/linked the domain, hostname, and URL pages, regardless of any other place you think is important. Someone who's just reading about consider_tlds needs to understand what it's doing with/without the env file and decide whether they need it.

salty-horse commented 1 month ago

Should I squash all the commits, or do you want to merge it yourself?

yozachar commented 1 month ago

I'll squash and merge.

yozachar commented 1 month ago

Thanks for the PR!

salty-horse commented 1 month ago

Thank you for the help and accepting the feature!