Add Elasticsearch output plugin (#374)

micrictor commented 3 years ago

Creates an Elasticsearch output plugin using the _bulk API.

In the config file, the GRR Admin sets:

url - Required; the Elasticsearch endpoint to index the documents at
token - Optional; The authentication token to include with the indexing requests
verify_https - Optional; Whether to validate the server certificate. Includes a disclaimer on why this is a bad idea
index - Optional; The elasticsearch index to place the flow results into; Default "grr-flows"

When executing the flow, the operator can specify:

tags - Arbitrary tags placed on the flow document(s)
index - Destination index for the documents, if desired to override the configured index

Note that, as a collateral change, I had to add an optional indent parameter to the json.Dump method to avoid messing up the line-delimited nature of the Elasticsearch API.

This resolves #374

micrictor commented 3 years ago

Ack on all accounts - debug print is an obvious oopsie. I’ll try to figure out a... less bad way to decompose the query data. I contemplated making a custom object just to represent this relationship, but that felt like it would be taking away from the readability of the test.

I see various python2 interop stuff throughout the code base, so I’ll use format in case there’s someone still relying on that.

On Fri, Apr 9, 2021 at 4:47 AM Max Vogler @.***> wrote:

@.**** approved this pull request.

Thank you for this amazing pull request! I'm pre-approving, because there are only minor code health changes to be made and I am out of office next week. One of my teammates can merge next week or I can merge the week after when I'm back.

In grr/server/grr_response_server/output_plugins/elasticsearch_plugin_test.py https://github.com/google/grr/pull/923#discussion_r610555143:

patcher = mock.patch.object(requests, 'post')

with patcher as patched:

plugin.ProcessResponses(plugin_state, messages)

plugin.Flush(plugin_state)

plugin.UpdateState(plugin_state)

return patched

def _ParseEvents(self, patched):

request = patched.call_args[KWARGS]['data']

Elasticsearch bulk requests are line-deliminated pairs, where the first

line is the index command and the second is the actual document to index

split_request = request.split('\n')

print(split_request)

Please remove the print() debugging statement.

In grr/server/grr_response_server/output_plugins/elasticsearch_plugin_test.py https://github.com/google/grr/pull/923#discussion_r610555874:

plugin.ProcessResponses(plugin_state, messages)

plugin.Flush(plugin_state)

plugin.UpdateState(plugin_state)

return patched

def _ParseEvents(self, patched):

request = patched.call_args[KWARGS]['data']

Elasticsearch bulk requests are line-deliminated pairs, where the first

line is the index command and the second is the actual document to index

split_request = request.split('\n')

print(split_request)

update_pairs = [(split_request[i], split_request[i + 1])

for i in range(0, len(split_request), 2)]

parsed_pairs = [(json.Parse(i[0]), json.Parse(i[1])) for i in update_pairs]

Instead of deconstructing the tuple, parsing the elements, and reconstructing a tuple, could you call json.Parse on the whole list in split_requests? This might simplify the code a little, e.g.:

split_request = [json.Parse(line) for line in request.split("\n")]

In grr/server/grr_response_server/output_plugins/elasticsearch_plugin.py https://github.com/google/grr/pull/923#discussion_r610559497:

def _SendEvents(self, events: List[JsonDict]) -> None:

"""Uses the Elasticsearch bulk API to index all the events in a single

request.

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docs-bulk.html

"""

if self._token:

headers = {"Authorization": "Basic {}".format(self._token)}

else:

headers = {}

index_command = json.Dump({"index": {"_index": self._index}}, indent=None)

Each index operation is two lines, the first defining the index settings,

the second is the actual document to be indexed

data = "\n".join(

index_command + "\n" + json.Dump(event, indent=None)

Please use string formatting (f-strings or format()) instead of manual string concatenation, e.g.:
data = "\n".join(
  f"{index_command}\n{json.Dump(event, indent=None)}"
  for event in events)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/grr/pull/923#pullrequestreview-632315856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTCAGLTDMXI32KB3P2S36DTH3SM5ANCNFSM42UGAN2Q .

micrictor commented 3 years ago

Pushed in the updates discussed above - mostly removing python2 compatibility stuff, adding docs for the new indent parameter, and other minor formatting fixes.

As noted above, I didn't port the tests to use responses, and don't really intend on doing so. If that's blocking, this PR will likely stay open for quite some time before I circle back to that (or someone else picks it up)

max-vogler commented 3 years ago

I suggest merging without the responses test refactoring in order to merge this functionality. The test is very close to splunk_plugin_test.py, which uses the same mocking. In the future, one change can refactor both tests. @panhania WDYT?

panhania commented 3 years ago

Sounds reasonable to me.

max-vogler commented 3 years ago

Thanks a lot for your contribution @micrictor, well done! If you like, send me a PR that adds your name and email to ACKNOWLEDGEMENTS.

google / grr

Add Elasticsearch output plugin (#374) #923

Thank you for this amazing pull request! I'm pre-approving, because there are only minor code health changes to be made and I am out of office next week. One of my teammates can merge next week or I can merge the week after when I'm back.

Elasticsearch bulk requests are line-deliminated pairs, where the first

line is the index command and the second is the actual document to index

Please remove the print() debugging statement.

Elasticsearch bulk requests are line-deliminated pairs, where the first

line is the index command and the second is the actual document to index

Each index operation is two lines, the first defining the index settings,

the second is the actual document to be indexed