Closed nyck33 closed 1 year ago
Hi @nyck33!
Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can! In the mean time, feel free to add any relevant information to this issue.
Hi @nyck33 Morpheus integration with the BlueField DPU is one of several use cases for the framework. However, there are many others that are not dependent at all on that integration.
Please take a look at some of the examples in the repo: https://github.com/nv-morpheus/Morpheus/tree/branch-23.11/examples https://docs.nvidia.com/morpheus/examples/index.html
@pdmack thanks for pointing those out and I saw this: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/examples/ransomware_detection/common/feature_extractor.py so now it's basically up to me to step through some of these examples to understand how the inputs are made.
But I have one question. Say for example: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/examples/ransomware_detection/models/ransomw-model-short-rf/config.pbtxt so know it's expecting a 297 *1 dimension input vector which I am guessing can be made by the crreate_features.py script after extract_features.py makes the json files found here: https://github.com/nv-morpheus/Morpheus/tree/branch-23.11/examples/data/appshield/Heur/snapshot-1 based on telemetry it measures right? But I still have no idea what this model is, if there was any transfer learning, what dataset it was trained on, how it can be fine-tuned more, etc.
Are some of these models proprietory? Am I wrong in assuming they would be on Hugging Face or some other publicly well known model depot?
Also with telemetry collection on Windows, I was asking ChatGPT about this: https://chat.openai.com/share/425a2879-ccfd-48f4-a359-8c39e073a256
But would doing something like below be sufficient?
ChatGPT:
Based on the contents of data.zip
, we can certainly refine the scripts more precisely to mimic the telemetry data structure you provided and considering the Python files you uploaded.
Since you’re requesting an Ubuntu version, the collection of some specific telemetry data like handles
and ldrmodules
will be different or might not be directly applicable as those are more Windows-specific. However, we can collect process, environment variables, network connections, and other similar telemetry data using Python libraries like psutil
and system commands available in Ubuntu.
Let’s refine the scripts more closely to your needs:
Here’s a refined version of the Ubuntu Telemetry Data Collection script. This script will collect data similar to what was observed in data.zip
:
import os
import json
import time
import psutil
class TelemetryCollector:
def __init__(self, base_dir='raw_telemetry_data'):
self.base_dir = base_dir
os.makedirs(self.base_dir, exist_ok=True)
def collect_data(self):
data = {
'cmdline': [p.cmdline() for p in psutil.process_iter(attrs=['cmdline'])],
'envars': dict(os.environ),
'pslist': [p.info for p in psutil.process_iter(attrs=['pid', 'name'])],
'netscan': [c.laddr + c.raddr for c in psutil.net_connections(kind='inet')],
# Add other required telemetry data collection methods here
}
return data
def run(self, interval=60):
while True:
data = self.collect_data()
timestamp = time.strftime("%Y-%m-%d_%H-%M-%S")
for data_type, content in data.items():
file_path = os.path.join(self.base_dir, f'{data_type}_{timestamp}.json')
with open(file_path, 'w') as f:
json.dump(content, f)
time.sleep(interval)
if __name__ == "__main__":
collector = TelemetryCollector()
collector.run()
This step will involve reading the collected raw telemetry data and applying the feature_extractor.py
and create_features.py
scripts to structure the data into the snapshot-x
directories and telemetry JSON files.
import os
import time
import glob
from feature_extractor import FeatureExtractor # Importing from the provided Python file
class FeatureExtractorDriver:
def __init__(self, raw_data_dir='raw_telemetry_data', structured_data_dir='structured_data'):
self.raw_data_dir = raw_data_dir
self.structured_data_dir = structured_data_dir
os.makedirs(self.structured_data_dir, exist_ok=True)
self.feature_extractor = FeatureExtractor() # Initialize FeatureExtractor
def structure_data(self):
# Example to process raw data files in raw_data_dir
# Adjust the logic as per the actual methods of feature_extractor.py and create_features.py
for file_path in glob.glob(os.path.join(self.raw_data_dir, '*.json')):
with open(file_path, 'r') as f:
raw_data = json.load(f)
# Process raw_data using feature_extractor and create_features.py
# Save the processed data in structured_data_dir in snapshot-x directories
# Implement the details based on the actual methods in the provided Python files
pass
def run(self, interval=60):
while True:
self.structure_data()
time.sleep(interval)
if __name__ == "__main__":
driver = FeatureExtractorDriver()
driver.run()
structure_data
method in the FeatureExtractorDriver class needs to be implemented according to the actual details and methods available in the feature_extractor.py
and create_features.py
scripts you provided. The current placeholder is just to give you an outline.Would you like any specific refinements or additions to these scripts based on the provided Python files and the specific feature extraction methods they contain?
or for Windows:
ChatGPT:
Absolutely, for a Windows environment, especially Windows 11, we can utilize Python libraries like psutil
for general system information, and we may need to use additional libraries or system commands to collect Windows-specific telemetry data akin to the one in the data.zip
file.
Here’s a refined version tailored for Windows:
import os
import json
import time
import psutil
import subprocess # To run Windows-specific commands if needed
class TelemetryCollector:
def __init__(self, base_dir='raw_telemetry_data'):
self.base_dir = base_dir
os.makedirs(self.base_dir, exist_ok=True)
def collect_cmdline(self):
return {proc.info['pid']: proc.cmdline() for proc in psutil.process_iter(attrs=['pid', 'cmdline'])}
def collect_envars(self):
# Collecting environment variables. Specific method may vary.
return dict(os.environ)
def collect_pslist(self):
return [p.info for p in psutil.process_iter(attrs=['pid', 'name'])]
def collect_netscan(self):
return [conn.laddr + conn.raddr for conn in psutil.net_connections(kind='inet')]
# Collect additional windows specific data like handles, ldrmodules, etc.
def collect_data(self):
data = {
'cmdline': self.collect_cmdline(),
'envars': self.collect_envars(),
'pslist': self.collect_pslist(),
'netscan': self.collect_netscan(),
# Add other required telemetry data collection methods here
}
return data
def run(self, interval=60):
while True:
data = self.collect_data()
timestamp = time.strftime("%Y-%m-%d_%H-%M-%S")
for data_type, content in data.items():
file_path = os.path.join(self.base_dir, f'{data_type}_{timestamp}.json')
with open(file_path, 'w') as f:
json.dump(content, f)
time.sleep(interval)
if __name__ == "__main__":
collector = TelemetryCollector()
collector.run()
This step remains largely similar to the Ubuntu version, with the focus being on reading the raw telemetry data, applying the feature_extractor.py
and create_features.py
scripts, and structuring the data into the snapshot-x
directories and telemetry JSON files.
import os
import time
import glob
from feature_extractor import FeatureExtractor # Importing from the provided Python file
class FeatureExtractorDriver:
def __init__(self, raw_data_dir='raw_telemetry_data', structured_data_dir='structured_data'):
self.raw_data_dir = raw_data_dir
self.structured_data_dir = structured_data_dir
os.makedirs(self.structured_data_dir, exist_ok=True)
self.feature_extractor = FeatureExtractor() # Initialize FeatureExtractor
def structure_data(self):
# Example to process raw data files in raw_data_dir
# Adjust the logic as per the actual methods of feature_extractor.py and create_features.py
for file_path in glob.glob(os.path.join(self.raw_data_dir, '*.json')):
with open(file_path, 'r') as f:
raw_data = json.load(f)
# Process raw_data using feature_extractor and create_features.py
# Save the processed data in structured_data_dir in snapshot-x directories
# Implement the details based on the actual methods in the provided Python files
pass
def run(self, interval=60):
while True:
self.structure_data()
time.sleep(interval)
if __name__ == "__main__":
driver = FeatureExtractorDriver()
driver.run()
structure_data
method in the FeatureExtractorDriver
class needs to be implemented according to the actual details and methods available in the feature_extractor.py
and create_features.py
scripts you provided.handles
and ldrmodules
might require additional Windows-specific implementations or tools, which need to be incorporated based on your exact requirements.Please review and modify these scripts according to your exact needs, especially focusing on the collection of Windows-specific telemetry data and the details of feature extraction and structuring based on the provided Python files. If you need assistance with a specific part of these scripts or have additional requirements, feel free to ask!
https://github.com/aquasecurity/appshield Appshield is mentioned in the documentation somewhere but noticed that is an archived repo so not maintained right? So that makes NVIDIA Morpheus seem ever-so-slightly outdated or not maintained well since it references an archived repo. Can @pdmack or someone else comment?
Hey @nyck33 - the models and examples here are just that: example use-cases of the Morpheus SDK. They're not meant to be the final models or workflow that you implement, more of a starter for you to see how you may choose to make Morpheus fit within your system.
On the AppShield repo, that's not the AppShield referenced in our workflow, it just shares the same name. Per our notebook we specifically note that App Shield means DOCA App Shield.
I think this issue will be better served as a Discussion thread so I'm going to convert it - if you have a specific documentation change you'd like to see, please file an issue we can action on, otherwise we can continue to discuss here.
How would you describe the priority of this documentation request
Medium
Describe the future/missing documentation
https://docs.nvidia.com/doca/sdk/app-shield-programming-guide/index.html#abstract
The whole time I believed it would be possible to but it's not? It is or I'd have to figure out the telemetry part on my own? I mean, if I want to use Morpheus but DPU is not available on GCP, Azure or AWS or available in limited quantities then I'm in trouble right?
Can you clarify on this point?
Where have you looked?
https://github.com/nyck33/Morpheus_nkim_fork/blob/branch-23.11/docs/source/models_and_datasets.md
https://docs.nvidia.com/doca/sdk/app-shield-programming-guide/index.html#abstract
https://developer.nvidia.com/networking/doca
https://docs.nvidia.com/doca/sdk/developer-qsg/index.html
Code of Conduct