muchdogesec / txt2stix

txt2stix is a Python script that is designed to identify and extract IoCs and TTPs from text files, identify the relationships between them, convert them to STIX 2.1 objects, and output as a STIX 2.1 bundle.
https://www.dogesec.com/
Apache License 2.0
23 stars 3 forks source link

Add support for LLM validation pipeline #34

Open himynamesdave opened 2 weeks ago

himynamesdave commented 2 weeks ago

We currently support OpenAI.

We also only wait for one response and use that.

We can try and tune out hallucinations, but the reality is we really need to get the AIs to "check each others work".

We should also add support for

User can set the API keys in the env file, as they do for Open AI now.

We should also remove the model from the env file, and let user pass this as a flag --ai_models

Which accepts a dictionary list; openai:gpt-4o, openai:gpt-4o-mini,gemini-1_5-flash` ...

User must input at least one. However, they can add as many as they want.

The more they add, the more accurate the output should become, as it will be an amalgamation of data returned from multiple models.

The pipeline should work like this

We should also consider enhancements since we built this to include things like OpenAI structured outputs

https://openai.com/index/introducing-structured-outputs-in-the-api/

fqrious commented 1 day ago

txt2stix compares the relationships (json files) from all models, it only considers relationships found in >=2 models (if >= 2 models specified) or 1 model (if only 1 model specified)

What happens if 2 models return a pair but with different relationship_type?

himynamesdave commented 1 day ago

Good question.

We don't want this process to be too rigid.

Maybe we need to introduce more steps to check doubt.

E.g. where there is not a consensus. Go back and check with all models>

e.g.

I have another analyst arguing that <XXX> is actually a <XXX> relationship_type?

I am not sure who is correct.

Can you please review the text again, and confirm your choice.

The above could be good for extractions too. e.g. when only one model reports an extraction, you can check if other models have missed it

himynamesdave commented 1 day ago

Do you have any suggestions @fqrious ?

fqrious commented 1 day ago

You know every subsequent calls to the API uses extra used_tokens + current_prompt_tokens tokens.

When you reply to chatgpt (or any LLM), you're actually sending your entire chat history back to it.

What this means is that if there are 100 of these, it will, at the very least use 100x as many tokens as if we just queried once?

himynamesdave commented 1 day ago

Oh, i didn't realise that! Maybe we try and box to have a max of 4x checks. The first prompt for extractions (then check), then the first prompt for relationships (then check)?

We could also try and tokenise some parameters (e.g. temperature) for user to tweak the responses

This is a nice article in the security domain we might be able to take some learnings from?

https://medium.com/@dylanhwilliams/utilizing-generative-ai-and-llms-to-automate-detection-writing-5e4ea074072e