google / turbinia

Automation and Scaling of Digital Forensics Tools
Apache License 2.0
731 stars 160 forks source link

New Turbinia LLM analyzer, LLM lib interface and LLM lib implemntation for VertexAI #1441

Closed sa3eed3ed closed 6 months ago

sa3eed3ed commented 6 months ago

New Turbinia LLM analyzer, LLM lib interface and LLM lib implementation for VertexAI

please assign to @hacktobeer for review, he is aware of this work

hacktobeer commented 6 months ago

Excellent @sa3eed3ed - I have assigned myself and will review before EOW.

jleaniz commented 6 months ago

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

sa3eed3ed commented 6 months ago

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

Done, added version, I thought even if google-api-core is removed from pyproject.toml the poetry.lock file will have all the deps needed by vertexAI package

jleaniz commented 6 months ago

Drive-by comment: could we specify a minimum version of the new dependencies in pyproject.toml with "^x.y.z" instead of "*"? That way we are less likely to run into dependency breakages down the line. There's also an open PR that will remove most GCP library dependencies from the Turbinia code base. From what I can tell, the vertexAI package only depends on google-api-core which would be kept anyway so it's not a problem.

Done, added version, I thought even if google-api-core is removed from pyproject.toml the poetry.lock file will have all the deps needed by vertexAI package

Yes, it will have the dependencies. My point was just to add a version, nothing else is needed. :) the core lib is included in libcloudforwnsics dependencies as well , which is already in the tonl file

hacktobeer commented 6 months ago

Thanks @sa3eed3ed. I have reviewed and tested the PR, looks pretty cool, looking forward to getting more real life results! I have no other review comments. Example output for others following along:

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894370-9c15b072c9bd49f8b5e13fd04b4fbcad-FileArtifactExtractionTask/export/etc/redis/redis.conf): **Summary:** Redis configuration file contains default bind address of "0.0.0.0", allowing remote clients to connect without authentication.

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894292-8b20bc2016f14f16b5d5bbd8ee39b278-FileArtifactExtractionTask/export/home/dummyuser/.jupyter/jupyter_notebook_config.py): **Summary:** Jupyter Notebook server is exposed to the internet with weak security settings, allowing unauthorized access, remote code execution, and potential compromise of sensitive data.

* LLMAnalyzerTask (/evidence/002ef2465f6b46c1a63d2ad93c783a02/1708894416-d4f26a75a3124996bf90723c51c501a3-FileArtifactExtractionTask/export/etc/ssh/sshd_config): **SSH configuration allows weak ciphers, root login, password authentication, and empty passwords, posing a high security risk.**
hacktobeer commented 6 months ago

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

hacktobeer commented 6 months ago

For future ideas regarding this analyser:

aarontp commented 6 months ago

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes here: https://github.com/google/turbinia/tree/master/turbinia/config/recipes, but we do have them in the disk related dftimewolf recipes, so we could include it in the turbinia recipes used by those (I can't remember if those disk related dftimewolf recipes are currently just using the default recipe, or if there is a dedicated recipe, but we do have a goal of making every dftimewolf recipe use a corresponding turbinia recipe this year).

hacktobeer commented 6 months ago

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes ....

It's fast, faster than plaso. FileExtraction is fast as the artifact definitions are pretty specific and VertexAI calling is fast as well. It will be done faster than the plaso task that is ran in parallel.

sa3eed3ed commented 6 months ago

@aarontp - before I merge this can I get your opinion on the inclusion of this analyser in all triage recipes?

Do we have any data about how long it takes to run on a typical input disk? Assuming it doesn't take too long to run, generally I would say it makes sense to include it anywhere we are including the other analysis tasks, which at the moment are not in the triage recipes as defined by the triage-* recipes ....

It's fast, faster than plaso. FileExtraction is fast as the artifact definitions are pretty specific and VertexAI calling is fast as well. It will be done faster than the plaso task that is ran in parallel.

Removed from Triage recipes

hacktobeer commented 6 months ago

Ran local tests and looks good. One final nit. Can you add below to the configuration template? turbinia/config/turbinia_config_tmpl.py

}, {
    'job': 'LLMAnalysisJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600
}, {
    'job': 'LLMArtifactsExtractionJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600

After that I'll do a final check if the e2e tests run fine and will approve/merge

sa3eed3ed commented 6 months ago

Ran local tests and looks good. One final nit. Can you add below to the configuration template? turbinia/config/turbinia_config_tmpl.py

}, {
    'job': 'LLMAnalysisJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600
}, {
    'job': 'LLMArtifactsExtractionJob',
    'programs': [],
    'docker_image': None,
    'timeout': 600

After that I'll do a final check if the e2e tests run fine and will approve/merge

done, I made the timeout 3600 matching default https://github.com/google/turbinia/blob/d8c7377e53bff0d88512a660f3dcbdde52b3cb71/turbinia/job_utils.py#L34 I don't expect it to take 1 hour, but there seem to be many other jobs with longer timeouts but if you think this might be problematic feel free to amend

hacktobeer commented 6 months ago

Local e2e (with api key added) run good. I am going to approve and merge, we can tune based on real world usage results. @sa3eed3ed Thank you very much for this awesome contribution. I am looking forward to tune this based on the results!