Open moodymudskipper opened 1 month ago
expect_correct(conversation, statement) expect_synonymous(conversation, answer) expect_synonymous_snaphot(x)
This uses LLMs for the assessment, might be slow but at least we'd have tests. These can be run only on CI. It seems like we can provide api keys to GitHub to be kept secret and used by GitHub actions so it might "just work".
This uses LLMs for the assessment, might be slow but at least we'd have tests. These can be run only on CI. It seems like we can provide api keys to GitHub to be kept secret and used by GitHub actions so it might "just work".