Closed salty-horse closed 1 month ago
Hi, validators already uses an environment variable to decide whether or not to throw errors, so I thought (again that) we could do something like this instead:
class _TLDList:
"""Cache IANA TLDs."""
cache = set[str]()
def _load_tld_to_memory(tld_file_path: Path):
"""Load IANA TLDs to memory."""
if not _TLDList.cache:
with tld_file_path.open() as tld_f:
_ = next(tld_f) # ignore the first line
_TLDList.cache = set(line.strip() for line in tld_f)
return _TLDList.cache
def _iana_tld():
"""Provide IANA TLDs."""
# # source: https://data.iana.org/TLD/tlds-alpha-by-domain.txt
tld_file_path = Path(__file__).parent.joinpath("_tld.txt")
if environ.get("LOAD_TLD_TO_MEMORY", "False") == "True":
return _load_tld_to_memory(tld_file_path)
with tld_file_path.open() as tld_f:
_ = next(tld_f) # ignore the first line
for line in tld_f:
yield line.strip()
Environment LOAD_TLD_TO_MEMORY
is the pivot here.
Some questions:
RAISE_VALIDATION_ERROR
and LOAD_TLD_TO_MEMORY
are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables. Shouldn't they include a prefix such as PYTHON_VALIDATORS_
? How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True'
early in the project that uses it?set[str]
.Install and Use
page, but I think, another page can be added say Environment Variables
. It is probably better to update the documentation in a followup PR.... are awfully vague, and a bad practice since they are used in a global setting with a lot of other unrelated environment variables.
Very true! You can for now set the environment variable as PYVLD_LOAD_TLD_TO_MEMORY
. (Is PYVLD_
a good prefix? Short ones are preferable) Again, another PR can bring more consistency.
How do you expect them to be used in a way that makes it obvious what they affect? Setting os.environ['LOAD_TLD_TO_MEMORY'] = 'True' early in the project that uses it?
As the documentation shows, it must be set via the shell, before running your program.
@classmethods
are totally fine. You no longer need to expose any functions to load TLDs.BTW, I wonder if it's worth it to hardcode 'com', 'org', 'net'
outside of the file, to catch the most common TLD's quickly.
Pushed a commit with changes, including my hard-coded list to cover the common TLDs before trying the file.
For the documentation, I think it should be covered/linked the domain, hostname, and URL pages, regardless of any other place you think is important. Someone who's just reading about consider_tlds
needs to understand what it's doing with/without the env file and decide whether they need it.
Should I squash all the commits, or do you want to merge it yourself?
I'll squash and merge.
Thanks for the PR!
Thank you for the help and accepting the feature!
Follow-up to the discussion in #362.
I wasn't sure what was meant by using
dataclass
, as this isn't a data-first class. I used a regular class, instead.One thing that's obviously missing is tests. I'm not familiar with pytest, and don't know how to re-run the existing domain tests after running the new "load" function.
Here are some basic timing results: