idaholab / Malcolm

Malcolm is a powerful, easily deployable network traffic analysis tool suite for full packet capture artifacts (PCAP files), Zeek logs and Suricata alerts.
https://idaholab.github.io/Malcolm/
Other
327 stars 53 forks source link

support /attributes and /events enpoints from MISP feed for Zeek intel generation #336

Closed mmguero closed 4 months ago

mmguero commented 5 months ago

We've got some MISP capabilities. The code that handles grabbing MISP indicators is here and says in its comments:

# download the URL and parse as JSON to figure out what it is. it could be: # - a manifest JSON (https://www.circl.lu/doc/misp/feed-osint/manifest.json) # - a directory listing containing a manifest.json (https://www.circl.lu/doc/misp/feed-osint/) # - a directory listing of misc. JSON files without a manifest.json

Some colleagues at USAF have been poking at it having discussions with us about expanding its compatibility of what it can handle. They've suggested we look at running MISP with docker compose and pulling from it directly.

I'm going to quote some of that discussion here:


Hey @mmguero I figured out what you should be querying from MISP to integrate with Malcolm. They're called "attributes".

echo "requesting $RESOURCE" curl \ --header "Authorization:$MISP_API_KEY" \ --header "Accept: application/json" \ --header "Content-Type: application/json" \ $MISP_URL$RESOURCE

- The result will be a JSON, which is a *massive* list of dictionaries (see the `id` field below, at 626143!). Each of which resembles the following:
{
    "id": "626143",
    "event_id": "166",
    "object_id": "0",
    "object_relation": null,
    "category": "Network activity",
    "type": "ip-dst",
    "value1": "80.51.7.66",
    "value2": "",
    "to_ids": true,
    "uuid": "37bb0fca-9043-4e09-a758-0efb3eae9937",
    "timestamp": "1704950232",
    "distribution": "5",
    "sharing_group_id": "0",
    "comment": "",
    "deleted": false,
    "disable_correlation": false,
    "first_seen": null,
    "last_seen": null,
    "value": "80.51.7.66"
},

{ ...

As you might guess, the meat of this is the `"value"` member of the JSON, as that IP address is the dirty IP address that MISP is trying to say is malicious. I assume the `value1` and `value2` fields should be parsed as well for similar reasons.

What I'd recommend is standing up a MISP instance, loading the full list of default feeds, enabling them, and fetching from them (a guide is [here](https://socfortress.medium.com/part-10-misp-threat-intel-68131b18f719)). 

And then waiting a good few hours for it to fetch down this massive list of attributes.

After you feel you have a good number of attributes (use the bash script at the beginning of this post and note the `id`, as that that is the current size of the list), iterate over each of them and enumerate what members are possible. Then decide which members are relevant to Malcolm and how they should be appropriately parsed. Perhaps that is just the IP addresses ("value" for relevant attributes), but you would know better than me in this regard!

--------------

I also found a bug! In MISP though: where when it pulls the attributes list, it doesn't give you the complete list.  It would only give me 60, despite there actually being 4,000,000. Thus, what I described as just GET-ing /attributes will not give you the full list.

The solution is to find out how long the list is by pulling the attribute list and checking each for the highest id number (it appears that the first element is the highest ID, but I do not conclusively know this. Hence, check each ID from the list given), and then individually pulling all IDs from 1 up to that number. 

Code snippet in the comment below to avoid blowing up the channel. Also, note that when we access attributes individually, we now have an outer single-element dictionary of just "Attribute."

--------------

!/usr/bin/env python3

import requests import time

MISP_URL="your misp url" MISP_API_KEY="your misp api key"

headers = { "Authorization": MISP_API_KEY, "Accept": "application/json", "Content-Type": "application/json" }

show the length of the JSON retrieved, which was 60

r = requests.get(f"{MISP_URL}/attributes", headers=headers) print("length of GET '/attributes' json (which shows a small number): " + str(len(r.json())))

print("===================================")

get the largest ID number

largest_id = 0 for attribute in r.json(): if "id" in attribute and int(attribute["id"]) > largest_id: print(f"new largest id {attribute['id']}!") largest_id = int(attribute["id"]) else: print(f"not a new largest ID: {attribute['id']}!") print("largest id (size of actual list of attributes): " + str(largest_id))

print("===================================")

iterate over each id individually and request the JSON corresponding to it

for attribute_id in range(int(largest_id)): r = requests.get(f"{MISP_URL}/attributes/view/{attribute_id}", headers=headers) item = r.json() if 'Attribute' in item: print(f"id = {item['Attribute']['id']}, value = {item['Attribute']['value']}") else: print(f"NOT AN ATTRIBUTE: id = {attribute_id}, json={item}") # this happens on id=1


--------------

Sample output:

length of GET '/attributes' json (which shows a small number): 60

new largest id 4000346! not a new largest ID: 4000345 not a new largest ID: 4000342 not a new largest ID: 4000336 not a new largest ID: 4000333 not a new largest ID: 4000332 not a new largest ID: 4000330 not a new largest ID: 4000328 not a new largest ID: 4000324 not a new largest ID: 4000321 not a new largest ID: 4000320 not a new largest ID: 4000319 not a new largest ID: 4000318 not a new largest ID: 4000314 not a new largest ID: 4000312 not a new largest ID: 4000309 not a new largest ID: 4000306 not a new largest ID: 4000303 not a new largest ID: 4000301 not a new largest ID: 4000299 not a new largest ID: 4000298 not a new largest ID: 4000294 not a new largest ID: 4000291 not a new largest ID: 4000289 not a new largest ID: 4000288 not a new largest ID: 4000286 not a new largest ID: 4000283 not a new largest ID: 4000281 not a new largest ID: 4000279 not a new largest ID: 4000277 not a new largest ID: 4000276 not a new largest ID: 4000275 not a new largest ID: 4000274 not a new largest ID: 4000273 not a new largest ID: 4000271 not a new largest ID: 4000268 not a new largest ID: 4000265 not a new largest ID: 4000263 not a new largest ID: 4000260 not a new largest ID: 4000256 not a new largest ID: 4000251 not a new largest ID: 4000249 not a new largest ID: 4000247 not a new largest ID: 4000243 not a new largest ID: 4000238 not a new largest ID: 4000234 not a new largest ID: 4000231 not a new largest ID: 4000227 not a new largest ID: 4000225 not a new largest ID: 4000223 not a new largest ID: 4000222 not a new largest ID: 4000219 not a new largest ID: 4000216 not a new largest ID: 4000214 not a new largest ID: 4000212 not a new largest ID: 4000211 not a new largest ID: 4000210 not a new largest ID: 4000209 not a new largest ID: 4000204 not a new largest ID: 4000202 largest id (size of actual list of attributes): 4000346

NOT AN ATTRIBUTE: id = 0, json={'name': 'Invalid attribute', 'message': 'Invalid attribute', 'url': '/attributes/view/0'} id = 1, value = 101.32.254.178 id = 2, value = 103.123.62.146 id = 3, value = 103.151.125.131 id = 4, value = 103.193.179.52 id = 5, value = 104.131.72.118


and so on
mmguero commented 4 months ago
#!/usr/bin/env bash

RESOURCE="${1:-/attributes}"
MISP_URL=https://localhost:31443/
MISP_API_KEY=xxxxxxxxxxx

echo "requesting $RESOURCE" >&2
curl -fsSLk \
        --header "Authorization:$MISP_API_KEY" \
        --header "Accept: application/json" \
        --header "Content-Type: application/json" \
        $MISP_URL$RESOURCE
mmguero commented 4 months ago

I believe we can use the page, limit, type, from|to|last, to page over these attributes rather than going through from 1 to the highest number and then pulling them one at a time.

mmguero commented 4 months ago

The updated documentation:


MISP

In addition to loading Zeek intelligence files on startup, Malcolm will automatically generate a Zeek intelligence file for all Malware Information Sharing Platform (MISP) JSON files found under ./zeek/intel/MISP.

Additionally, if a special text file named .misp_input.txt is found in ./zeek/intel/MISP, that file will be read and processed as a list of MISP feed URLs, one per line, according to the following format:

misp|misp_url|auth_key (optional)

For example:

misp|https://example.com/data/feed-osint/manifest.json|df97338db644c64fbfd90f3e03ba8870
misp|https://example.com/doc/misp/|
misp|https://example.com/attributes|a943f5ff506ee6198e996333e0b672b1
misp|https://example.com/events|a943f5ff506ee6198e996333e0b672b1
…

Malcolm will attempt to connect to the MISP feed(s) and retrieve Attribute objects of MISP events and convert them to the Zeek intelligence format as described above. There are publicly available MISP feeds and communities, or users may run their own MISP instance.

Upon Malcolm connects to the URLs for the MISP feeds in .misp_input.txt, it will attempt to determine the format of the data served and process it accordingly. This could be presented as:

Note that only a subset of MISP attribute types can be expressed with the Zeek intelligence indicator types. MISP attributes with other types will be silently ignored.