URL model comparisons - Githubissues

[x] check out what % of characters / URLs in new dataset contain UNICODE (or rather, not ascii) characters
[x] clean up sqlite db (remove copied columns, switch train and test values, etc)
[x] modify sequence code to not use whitespace
[x] modify convent model to project to 87 instead of 101 dimensions?
[ ] add metrics to the tensorboard: accuracy rates at fixed FP rates: 10^-2,3,4,5, and maybe AUC under 10^-2?
[x] check out Rich's C ngram code, incorporate into behavioral_model
[ ] test out:
- [x] simple ngram MLP vs
- [x] simple ngram random forest vs
- [x] convent
- [x] mlp+random forest
- [x] mlp with high dropout
- [x] mlp with different learner
- [ ] mlp with a lower malware / benignware % (current = 50/50) on new iid dataset
[ ] how does punycode relate to all of this?
[ ] db stuff:
- [x] drop tmp from 40m iid db (ec2gpu2)
- [x] optimize space (vacuum)
- [x] copy to local
- [ ] copy from local to (ec2gpu1)
- [ ] run ^^ models on ec2gpu2
[ ] uh oh! Training is getting super slow for the later epochs! https://github.com/fchollet/keras/issues/3576. Probably need to adjust the ADAM gradient descent...

Counter({u'\x01': 90, u'\x02': 72, u'\x03': 121, u'\x04': 294, u'\x05': 16, u'\x06': 10, u'\x07': 93, u'\x08': 4, u'\x0b': 96, u'\x0c': 92, u'\x0e': 42, u'\x11': 258, u'\x12': 117, u'\x13': 100, u'\x14': 119, u'\x15': 164, u'\x16': 118, u'\x17': 49, u'\x18': 17, u'\x19': 109, u'\x1b': 1, u'\x1c': 84, u'\x1d': 48, u'\x1e': 5, u'\x1f': 23, u' ': 33760, u'"': 70356, u'<': 971, u'>': 4111, u'\\': 14896, u'^': 1229, u'`': 138, u'{': 31548, u'|': 75711, u'}': 31605, u'\x7f': 1})

hillarysanders / deep_learning

URL model comparisons #30