rapid7 / godap

The Data Analysis Pipeline
MIT License
17 stars 10 forks source link

Add recog support #7

Closed dabdine-r7 closed 5 years ago

dabdine-r7 commented 5 years ago

What is this?

This patch adds recog support based on @hdm's port of recog to go: https://github.com/hdm/recog-go.

This enables users of godap to feed lines into recog for automated fingerprinting use cases.

Testing done

Automated testing using goconvey. See recog_test.go

hdm commented 5 years ago

Regarding the matches vs filename, happy to change this in recog-go if it helps. I thought the plan for recog was to remove the matches key from the XML, which is why I went with the filename. Easy to implement both for the lookup as well.

dabdine-r7 commented 5 years ago

Regarding the matches vs filename, happy to change this in recog-go if it helps. I thought the plan for recog was to remove the matches key from the XML, which is why I went with the filename. Easy to implement both for the lookup as well.

@hdm Given recog-java already uses the filename, I think keeping the filename and moving away from the matches field is a fine suggestion. Perhaps support for either using configuration somehow? That way the API wouldn't necessarily have to change...

For example, setting an envar like RECOG_DB_NAMEFROM could either be unset, set to filename or set to attribute, with filename being the default...

@jhart-r7 may have thoughts as well.

jhart-r7 commented 5 years ago

It would be great to see tests added for this. Arguably required. The tests you've added for the arg parsing are a good start. You can (currently) pull tests version of maxmind from https://github.com/maxmind/geoip-api-php/raw/master/tests/data/ that should be compatible. The new bats work that I put in recently should make much of this easier and save us time because we could reuse the tests in dap (yes, we still need to add the corresponding bats support to dap)

Regarding matches vs the file name, I recall at one point the preference was to use matches so that files could be arbitrarily named (particularly in the even of supporting custom files not necessarily distributed by recog as in Nexpose).

dabdine-r7 commented 5 years ago

@jhart-r7 there are tests for parsing args and the recog filter -- are there other test cases you want to add? I'm unsure how the maxmind test databases fit in with this filter, but perhaps you meant to use those for the GeoIP filter (we can do that in a separate PR)?

I think having the attribute separate from the file name makes sense. Why don't we do both? My proposal in that case would be to make the current API call use the matches attr (as core ruby recog does), and use a ""FromFile" pattern (or something similar) to do the same thing, using the file name instead. Thoughts?

Tagging @gschneider-r7 who maintains the java recog client too.

dabdine-r7 commented 5 years ago

cherry picked the args code and tests to master, then rebased this branch in my repo on top of the new master, so the PR is cleaner now (no more discussion around args.go/args_test.go)

dabdine-r7 commented 5 years ago

@hdm After talking with @jhart-r7, looks like the proposal is to add an argument to configure recog to allow supplying the database name OR file name through the Match* methods database name parameter, as opposed to creating separate functions for each.

I can push a PR up in a bit. Also, I think I may throw in a method to load databases from disk as opposed to using the built in databases.

hdm commented 5 years ago

I pushed a change which allows it to work both ways without changes (same DB is now keyed off the matches attribute and the filename).

hdm commented 5 years ago

For loading new DBs, I think the design already supports that, but will double check and add an example/test case if needed.

hdm commented 5 years ago

v0.0.16 adds nition.LoadFingerprintsDir(path) as an alternative to the built-in fingerprints

hdm commented 5 years ago

it might make more sense to make these functions on the FingerprintSet struct in the future (add one-off files and a mix of built-in, disk, and byte blobs to the same fingerprint set).