hillarysanders / deep_learning

Learnin' about Deep Learnin!
1 stars 0 forks source link

URL model comparisons #30

Open hillarysanders opened 7 years ago

hillarysanders commented 7 years ago
hillarysanders commented 7 years ago

it looks like (for VT) data, about 2.4% of all URLs have non-urlvalid (ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-~!*'();:@&=+$,/?#[]%) characters. For those URLs, about 2.5% of the characters are ‘special’.

here’s the distribution of special chars over a million URL samples:

Counter({u'\x01': 90,
         u'\x02': 72,
         u'\x03': 121,
         u'\x04': 294,
         u'\x05': 16,
         u'\x06': 10,
         u'\x07': 93,
         u'\x08': 4,
         u'\x0b': 96,
         u'\x0c': 92,
         u'\x0e': 42,
         u'\x11': 258,
         u'\x12': 117,
         u'\x13': 100,
         u'\x14': 119,
         u'\x15': 164,
         u'\x16': 118,
         u'\x17': 49,
         u'\x18': 17,
         u'\x19': 109,
         u'\x1b': 1,
         u'\x1c': 84,
         u'\x1d': 48,
         u'\x1e': 5,
         u'\x1f': 23,
         u' ': 33760,
         u'"': 70356,
         u'<': 971,
         u'>': 4111,
         u'\\': 14896,
         u'^': 1229,
         u'`': 138,
         u'{': 31548,
         u'|': 75711,
         u'}': 31605,
         u'\x7f': 1})

Rich: "for looking at urls presented to endpoints that's probably not a huge deal, but Stupid Encoding Tricks[tm] are a mainstay of injection attacks against web applications, so it might be worth trying to find where in the collection pipeline the decoding is happening and trying to remove it"