chains/benchmarks, other LLMs

(closing #28 in favor of this)

At minimum before we merge this branch:

[x] generalize the llm.py op to route to different specific LLM sources
[x] generic bigbench binary classification script

for the binary classification script:

[x] accepts LLM ID and task name as input
[x] calculates at least one accuracy metric
[x] puts results somewhere useful (json file?)

these are nice to haves, basically they'd make it so we were using the bigbench tasks more as intended:

[ ] (blocked) need to be able to extract multiple key/values from json
[ ] (blocked) use "preferred_metric" from the task.json spec to decide metric

If #27 is resolved then we could compute multiple accuracy metrics. And we could also optionally explore having another script that runs the entire slate of tasks on a set of models.

saulpw / aipl

chains/benchmarks, other LLMs #30