ko-zu / psl

publicsuffixlist for python
Mozilla Public License 2.0
61 stars 10 forks source link

incorrect handling of weird domain names #29

Open pspacek opened 3 months ago

pspacek commented 3 months ago

Version affected

0.10.0.20240525

Summary

DNS encoding for "weird" names is not handled.

Reproducer

from publicsuffixlist import PublicSuffixList
psl = PublicSuffixList()
print(psl.privatesuffix("www.exa\\.mple.com"))
mple.com

Result "mple.com" is wrong because \. character is not a label separator. It's an escaped dot which is part of the exa\.mple label. The correct return value thus should be exa\.mple.com.

An underlying problem

Currently this library handles DNS names as strings. This does not match DNS definition of names: DNS names are defined as sequence of labels and individual labels can contain arbitrary binary data on the wire. "Unusual" bytes are then encoded with \ escape sequences when presented in text-format.

Use-case

Processing real traffic from traffic captures. It has lots of weird names which require escaping and the current string-based processing leads to incorrect results for these weird names.

Proposed solution

Extend the current API to accept tuple of labels instead of string. In that case it's responsibility of the caller to do the right thing, and if a software is reading stuff from PCAP files it's actually easier to pass the labels instead of constructing escaped string out of it, and then having it decoded once again in publicsuffic library again.

Alternative would be to implement full decoding of DNS names, but I think it's more work and slower performance for my use-case.

References

\DDD where each D is a digit is the octet corresponding to the decimal number described by DDD. The resulting octet is assumed to be text and is not checked for special meaning.



This problem was encountered by other people and there was a proposal to integrate PSL matching into a DNS-aware library dnspython:
https://github.com/rthalley/dnspython/pull/1082

I think it would be better if we can get this improved in publicsufficlist itself. What do you think?
ko-zu commented 3 months ago

Thank you for the detailed report.

From my understanding, the main blocking issue is that there is no way to pass arbitrary bytes as labels, correct? I will implement your proposed solution of using a tuple of bytes as input. This should cover all possible DNS packets without interfering with existing string-to-string conversion.

Should we also implement a keep_case flag? Since PSL should return all labels in lowercase, tuple-to-tuple comparisons will likely fail without this flag. Please let me know your thoughts.

Regarding the weird results, I believe this is just following the PSL's definition. There is no special treatment regarding backslashes.

pspacek commented 3 months ago

Apologies for late response.

My proposal is:

But maybe I'm completely wrong and @rthalley has a better idea?