nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
373 stars 140 forks source link

[DOC]: Can I use Morpheus without Blue Shield, DPU, Doca, etc? #1210

Closed nyck33 closed 1 year ago

nyck33 commented 1 year ago

How would you describe the priority of this documentation request

Medium

Describe the future/missing documentation

https://docs.nvidia.com/doca/sdk/app-shield-programming-guide/index.html#abstract

The whole time I believed it would be possible to but it's not? It is or I'd have to figure out the telemetry part on my own? I mean, if I want to use Morpheus but DPU is not available on GCP, Azure or AWS or available in limited quantities then I'm in trouble right?

Can you clarify on this point?

Where have you looked?

https://github.com/nyck33/Morpheus_nkim_fork/blob/branch-23.11/docs/source/models_and_datasets.md

https://docs.nvidia.com/doca/sdk/app-shield-programming-guide/index.html#abstract

https://developer.nvidia.com/networking/doca

https://docs.nvidia.com/doca/sdk/developer-qsg/index.html

Code of Conduct

jarmak-nv commented 1 year ago

Hi @nyck33!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.

pdmack commented 1 year ago

Hi @nyck33 Morpheus integration with the BlueField DPU is one of several use cases for the framework. However, there are many others that are not dependent at all on that integration.

Please take a look at some of the examples in the repo: https://github.com/nv-morpheus/Morpheus/tree/branch-23.11/examples https://docs.nvidia.com/morpheus/examples/index.html

nyck33 commented 1 year ago

@pdmack thanks for pointing those out and I saw this: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/examples/ransomware_detection/common/feature_extractor.py so now it's basically up to me to step through some of these examples to understand how the inputs are made.

But I have one question. Say for example: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/examples/ransomware_detection/models/ransomw-model-short-rf/config.pbtxt so know it's expecting a 297 *1 dimension input vector which I am guessing can be made by the crreate_features.py script after extract_features.py makes the json files found here: https://github.com/nv-morpheus/Morpheus/tree/branch-23.11/examples/data/appshield/Heur/snapshot-1 based on telemetry it measures right? But I still have no idea what this model is, if there was any transfer learning, what dataset it was trained on, how it can be fine-tuned more, etc.

nyck33 commented 1 year ago

Are some of these models proprietory? Am I wrong in assuming they would be on Hugging Face or some other publicly well known model depot?

nyck33 commented 1 year ago

Also with telemetry collection on Windows, I was asking ChatGPT about this: https://chat.openai.com/share/425a2879-ccfd-48f4-a359-8c39e073a256

But would doing something like below be sufficient?

ChatGPT: Based on the contents of data.zip, we can certainly refine the scripts more precisely to mimic the telemetry data structure you provided and considering the Python files you uploaded.

Since you’re requesting an Ubuntu version, the collection of some specific telemetry data like handles and ldrmodules will be different or might not be directly applicable as those are more Windows-specific. However, we can collect process, environment variables, network connections, and other similar telemetry data using Python libraries like psutil and system commands available in Ubuntu.

Let’s refine the scripts more closely to your needs:

Step 1: Telemetry Data Collection

Here’s a refined version of the Ubuntu Telemetry Data Collection script. This script will collect data similar to what was observed in data.zip:

import os
import json
import time
import psutil

class TelemetryCollector:
    def __init__(self, base_dir='raw_telemetry_data'):
        self.base_dir = base_dir
        os.makedirs(self.base_dir, exist_ok=True)

    def collect_data(self):
        data = {
            'cmdline': [p.cmdline() for p in psutil.process_iter(attrs=['cmdline'])],
            'envars': dict(os.environ),
            'pslist': [p.info for p in psutil.process_iter(attrs=['pid', 'name'])],
            'netscan': [c.laddr + c.raddr for c in psutil.net_connections(kind='inet')],
            # Add other required telemetry data collection methods here
        }
        return data

    def run(self, interval=60):
        while True:
            data = self.collect_data()
            timestamp = time.strftime("%Y-%m-%d_%H-%M-%S")
            for data_type, content in data.items():
                file_path = os.path.join(self.base_dir, f'{data_type}_{timestamp}.json')
                with open(file_path, 'w') as f:
                    json.dump(content, f)
            time.sleep(interval)

if __name__ == "__main__":
    collector = TelemetryCollector()
    collector.run()

Step 2: Feature Extraction and Structuring

This step will involve reading the collected raw telemetry data and applying the feature_extractor.py and create_features.py scripts to structure the data into the snapshot-x directories and telemetry JSON files.

import os
import time
import glob
from feature_extractor import FeatureExtractor  # Importing from the provided Python file

class FeatureExtractorDriver:
    def __init__(self, raw_data_dir='raw_telemetry_data', structured_data_dir='structured_data'):
        self.raw_data_dir = raw_data_dir
        self.structured_data_dir = structured_data_dir
        os.makedirs(self.structured_data_dir, exist_ok=True)
        self.feature_extractor = FeatureExtractor()  # Initialize FeatureExtractor

    def structure_data(self):
        # Example to process raw data files in raw_data_dir
        # Adjust the logic as per the actual methods of feature_extractor.py and create_features.py
        for file_path in glob.glob(os.path.join(self.raw_data_dir, '*.json')):
            with open(file_path, 'r') as f:
                raw_data = json.load(f)
            # Process raw_data using feature_extractor and create_features.py
            # Save the processed data in structured_data_dir in snapshot-x directories
            # Implement the details based on the actual methods in the provided Python files
            pass

    def run(self, interval=60):
        while True:
            self.structure_data()
            time.sleep(interval)

if __name__ == "__main__":
    driver = FeatureExtractorDriver()
    driver.run()

Note:

Would you like any specific refinements or additions to these scripts based on the provided Python files and the specific feature extraction methods they contain?

nyck33 commented 1 year ago

or for Windows:

ChatGPT: Absolutely, for a Windows environment, especially Windows 11, we can utilize Python libraries like psutil for general system information, and we may need to use additional libraries or system commands to collect Windows-specific telemetry data akin to the one in the data.zip file.

Here’s a refined version tailored for Windows:

Step 1: Telemetry Data Collection

import os
import json
import time
import psutil
import subprocess  # To run Windows-specific commands if needed

class TelemetryCollector:
    def __init__(self, base_dir='raw_telemetry_data'):
        self.base_dir = base_dir
        os.makedirs(self.base_dir, exist_ok=True)

    def collect_cmdline(self):
        return {proc.info['pid']: proc.cmdline() for proc in psutil.process_iter(attrs=['pid', 'cmdline'])}

    def collect_envars(self):
        # Collecting environment variables. Specific method may vary.
        return dict(os.environ)

    def collect_pslist(self):
        return [p.info for p in psutil.process_iter(attrs=['pid', 'name'])]

    def collect_netscan(self):
        return [conn.laddr + conn.raddr for conn in psutil.net_connections(kind='inet')]

    # Collect additional windows specific data like handles, ldrmodules, etc.

    def collect_data(self):
        data = {
            'cmdline': self.collect_cmdline(),
            'envars': self.collect_envars(),
            'pslist': self.collect_pslist(),
            'netscan': self.collect_netscan(),
            # Add other required telemetry data collection methods here
        }
        return data

    def run(self, interval=60):
        while True:
            data = self.collect_data()
            timestamp = time.strftime("%Y-%m-%d_%H-%M-%S")
            for data_type, content in data.items():
                file_path = os.path.join(self.base_dir, f'{data_type}_{timestamp}.json')
                with open(file_path, 'w') as f:
                    json.dump(content, f)
            time.sleep(interval)

if __name__ == "__main__":
    collector = TelemetryCollector()
    collector.run()

Step 2: Feature Extraction and Structuring

This step remains largely similar to the Ubuntu version, with the focus being on reading the raw telemetry data, applying the feature_extractor.py and create_features.py scripts, and structuring the data into the snapshot-x directories and telemetry JSON files.

import os
import time
import glob
from feature_extractor import FeatureExtractor  # Importing from the provided Python file

class FeatureExtractorDriver:
    def __init__(self, raw_data_dir='raw_telemetry_data', structured_data_dir='structured_data'):
        self.raw_data_dir = raw_data_dir
        self.structured_data_dir = structured_data_dir
        os.makedirs(self.structured_data_dir, exist_ok=True)
        self.feature_extractor = FeatureExtractor()  # Initialize FeatureExtractor

    def structure_data(self):
        # Example to process raw data files in raw_data_dir
        # Adjust the logic as per the actual methods of feature_extractor.py and create_features.py
        for file_path in glob.glob(os.path.join(self.raw_data_dir, '*.json')):
            with open(file_path, 'r') as f:
                raw_data = json.load(f)
            # Process raw_data using feature_extractor and create_features.py
            # Save the processed data in structured_data_dir in snapshot-x directories
            # Implement the details based on the actual methods in the provided Python files
            pass

    def run(self, interval=60):
        while True:
            self.structure_data()
            time.sleep(interval)

if __name__ == "__main__":
    driver = FeatureExtractorDriver()
    driver.run()

Note:

Please review and modify these scripts according to your exact needs, especially focusing on the collection of Windows-specific telemetry data and the details of feature extraction and structuring based on the provided Python files. If you need assistance with a specific part of these scripts or have additional requirements, feel free to ask!

nyck33 commented 1 year ago

https://github.com/aquasecurity/appshield Appshield is mentioned in the documentation somewhere but noticed that is an archived repo so not maintained right? So that makes NVIDIA Morpheus seem ever-so-slightly outdated or not maintained well since it references an archived repo. Can @pdmack or someone else comment?

jarmak-nv commented 1 year ago

Hey @nyck33 - the models and examples here are just that: example use-cases of the Morpheus SDK. They're not meant to be the final models or workflow that you implement, more of a starter for you to see how you may choose to make Morpheus fit within your system.

On the AppShield repo, that's not the AppShield referenced in our workflow, it just shares the same name. Per our notebook we specifically note that App Shield means DOCA App Shield.

I think this issue will be better served as a Discussion thread so I'm going to convert it - if you have a specific documentation change you'd like to see, please file an issue we can action on, otherwise we can continue to discuss here.