Add support for LLM validation pipeline

himynamesdave commented 2 weeks ago

We currently support OpenAI.

We also only wait for one response and use that.

We can try and tune out hallucinations, but the reality is we really need to get the AIs to "check each others work".

We should also add support for

Claude
Gemini

User can set the API keys in the env file, as they do for Open AI now.

We should also remove the model from the env file, and let user pass this as a flag --ai_models

Which accepts a dictionary list; openai:gpt-4o, openai:gpt-4o-mini,gemini-1_5-flash` ...

User must input at least one. However, they can add as many as they want.

The more they add, the more accurate the output should become, as it will be an amalgamation of data returned from multiple models.

The pipeline should work like this

models submitted by user read
[IF AI Extractions] txt2stix sends the data to all the models specified for extraction generation, as it does now
[IF AI Extractions] txt2stix compares the extractions (json files) from all models, it only considers extractions found in >=2 models (if >= 2 models specified) or 1 model (if only 1 model specified)
[IF AI Relationships] txt2stix sends the data to all the models specified for relationship generation, as it does now
[IF AI Relationships] txt2stix compares the relationships (json files) from all models, it only considers relationships found in >=2 models (if >= 2 models specified) or 1 model (if only 1 model specified)

We should also consider enhancements since we built this to include things like OpenAI structured outputs

https://openai.com/index/introducing-structured-outputs-in-the-api/

fqrious commented 1 day ago

txt2stix compares the relationships (json files) from all models, it only considers relationships found in >=2 models (if >= 2 models specified) or 1 model (if only 1 model specified)

What happens if 2 models return a pair but with different relationship_type?

himynamesdave commented 1 day ago

Good question.

We don't want this process to be too rigid.

Maybe we need to introduce more steps to check doubt.

E.g. where there is not a consensus. Go back and check with all models>

e.g.

I have another analyst arguing that <XXX> is actually a <XXX> relationship_type?

I am not sure who is correct.

Can you please review the text again, and confirm your choice.

The above could be good for extractions too. e.g. when only one model reports an extraction, you can check if other models have missed it

himynamesdave commented 1 day ago

Do you have any suggestions @fqrious ?

fqrious commented 1 day ago

You know every subsequent calls to the API uses extra used_tokens + current_prompt_tokens tokens.

When you reply to chatgpt (or any LLM), you're actually sending your entire chat history back to it.

What this means is that if there are 100 of these, it will, at the very least use 100x as many tokens as if we just queried once?

himynamesdave commented 1 day ago

Oh, i didn't realise that! Maybe we try and box to have a max of 4x checks. The first prompt for extractions (then check), then the first prompt for relationships (then check)?

We could also try and tokenise some parameters (e.g. temperature) for user to tweak the responses

This is a nice article in the security domain we might be able to take some learnings from?

https://medium.com/@dylanhwilliams/utilizing-generative-ai-and-llms-to-automate-detection-writing-5e4ea074072e

muchdogesec / txt2stix

Add support for LLM validation pipeline #34