Open philbudne opened 11 months ago
Thinking about the application domain, I think ports should be ignored for normalization. The IP address being trimmed seems like a bug, though perhaps not one that has negative impacts. I'm not sure what to do about the escaping.
A few things I found while looking for pathological cases:
http://do.ma.in:80/what/ever
is normalized tohttp://do.ma.in/what/ever
and so ishttps://do.ma.in/what/ever
buthttps://do.ma.in:442/what/ever
comes out ashttp://do.ma.in:443/what/ever
http://10.2.3.4/hello/world.html
comes out ashttp://2.3.4/hello/world.html
Spaces and
%20
in query strings are normalized to+
but%20
and+
in path are left as is space is changed to%20
UTF-8 in path is %-quoted, but
%27
is turned into'
(BUT'
is left alone, so the result is a uniform, but'
is officially a delimiter in https://datatracker.ietf.org/doc/html/rfc3986#section-2.2)The above two were seen in the wild in: http://www.seychellesnewsagency.com/articles/19841/Over++Seychelles%27+households+received+financial+assistance+following+Dec.++disasters