mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

possible url normalization issues #72

Open philbudne opened 11 months ago

philbudne commented 11 months ago

A few things I found while looking for pathological cases:

http://do.ma.in:80/what/ever is normalized to http://do.ma.in/what/ever and so is https://do.ma.in/what/ever but https://do.ma.in:442/what/ever comes out as http://do.ma.in:443/what/ever

http://10.2.3.4/hello/world.html comes out as http://2.3.4/hello/world.html

Spaces and %20 in query strings are normalized to + but %20 and + in path are left as is space is changed to %20

UTF-8 in path is %-quoted, but %27 is turned into ' (BUT ' is left alone, so the result is a uniform, but ' is officially a delimiter in https://datatracker.ietf.org/doc/html/rfc3986#section-2.2)

The above two were seen in the wild in: http://www.seychellesnewsagency.com/articles/19841/Over++Seychelles%27+households+received+financial+assistance+following+Dec.++disasters

rahulbot commented 10 months ago

Thinking about the application domain, I think ports should be ignored for normalization. The IP address being trimmed seems like a bug, though perhaps not one that has negative impacts. I'm not sure what to do about the escaping.