Open hillarysanders opened 7 years ago
it looks like (for VT) data, about 2.4% of all URLs have non-urlvalid
(ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-~!*'();:@&=+$,/?#[]%
) characters. For those URLs, about 2.5% of the characters are ‘special’.
here’s the distribution of special chars over a million URL samples:
Counter({u'\x01': 90,
u'\x02': 72,
u'\x03': 121,
u'\x04': 294,
u'\x05': 16,
u'\x06': 10,
u'\x07': 93,
u'\x08': 4,
u'\x0b': 96,
u'\x0c': 92,
u'\x0e': 42,
u'\x11': 258,
u'\x12': 117,
u'\x13': 100,
u'\x14': 119,
u'\x15': 164,
u'\x16': 118,
u'\x17': 49,
u'\x18': 17,
u'\x19': 109,
u'\x1b': 1,
u'\x1c': 84,
u'\x1d': 48,
u'\x1e': 5,
u'\x1f': 23,
u' ': 33760,
u'"': 70356,
u'<': 971,
u'>': 4111,
u'\\': 14896,
u'^': 1229,
u'`': 138,
u'{': 31548,
u'|': 75711,
u'}': 31605,
u'\x7f': 1})
Rich: "for looking at urls presented to endpoints that's probably not a huge deal, but Stupid Encoding Tricks[tm] are a mainstay of injection attacks against web applications, so it might be worth trying to find where in the collection pipeline the decoding is happening and trying to remove it"
[x] check out what % of characters / URLs in new dataset contain UNICODE (or rather, not ascii) characters
[x] clean up sqlite db (remove copied columns, switch train and test values, etc)
[x] modify
sequence
code to not use whitespace[x] modify convent model to project to 87 instead of 101 dimensions?
[ ] add metrics to the tensorboard: accuracy rates at fixed FP rates: 10^-2,3,4,5, and maybe AUC under 10^-2?
[x] check out Rich's C ngram code, incorporate into behavioral_model
[ ] test out:
[ ] how does punycode relate to all of this?
[ ] db stuff:
[ ] uh oh! Training is getting super slow for the later epochs! https://github.com/fchollet/keras/issues/3576. Probably need to adjust the ADAM gradient descent...