add practicald2t sharedtask 2024 (st24 prefixed) datasets

Adding st24-gsmarena, st24-ice_hockey, st24-openweather, st24-owid datasets i.e. their inputs and outputs for the dev splits.

Tasks / Content below:

[x] add the data for dev splits
[x] add the outputs for dev splits
[x] subclass the existing datasets with st24 prefix name so they loads
[x] Describe what was necessary to do to add the datasets usage for previewing the datasets and add screenshots.
[x] Describe how the evaluation script can be run on the shared task datasets

Not included:

[ ] Describe how the outputs were generated in quintd

How can you add a dataset if the format is supported? - What we did in this PR?

Add input data to factgenie/data/DATASET_NAME
Add outputs from your model_X to factgenie/outputs/DATASET_NAME/SPLIT_NAME/model_X.json
Subclass the dataloader e.g. we created factgenie/loaders/practicald2t_st24.py where we subclassed four datasets loaders classes.
We registered the classes into DATASET_CLASSES in factgenie/loaders/__init__.py

Assuming you have followed the steps for installing and running in the main README.md you are ready to see the new datasets in factgenie.

How to evaluate the existing outputs?

In the section above, we loaded inputs and outputs for the st24* datasets. Below, you will see how to obtain the annotations running the factgenie run-llm-eval command and how factgenie visualizes them.

Look at at the arguments for the command:

factgenie run-llm-eval --campaign_name $YOUR_CAMPAIGN_NAME --dataset_name $DATASET --split $SPLIT --llm_output_name $SUMMARY_GENERATING_MODEL --llm_metric_config factgenie/llm-eval/$LLM_EVALUATOR

We will name our campaign st24-demo-openweather-dev-llama3. We will choose st24-openweather dataset and its dev split. Our baseline model was zephyr and we will use the config for llama3 factgenie/llm-eval/ollama-llama3.yaml

$ factgenie list-datasets     # use this command to list all registered datasets
ice_hockey
gsmarena
openweather
owid
wikidata
logicnlg
dummy
st24-ice_hockey
st24-gsmarena
st24-openweather
st24-owid

Putting it all together we will run

factgenie run-llm-eval --campaign_name st24-demo-openweather-dev-llama3 --dataset_name st24-openweather --split dev --llm_output_name mistral --llm_metric_config factgenie/llm-eval/ollama-llama3.yaml

To run the command two more steps are needed :

Setup ollama server with llama3 model
Update the URL in the config to your ollama server. Just change the api_url in the config to the URL where your ollama server is running https://github.com/kasnerz/factgenie/blob/7faf6c75ccc8e57c74a5b1b42922b431f4dacd06/factgenie/llm-eval/ollama-llama3.yaml#L3

Once the command is finished, all your examples will have annotations; you can see them in the browser. I committed the annotation so feel free to visit https://quest.ms.mff.cuni.cz/namuddis/factgenie/browse?dataset=st24-openweather&split=dev&example_idx=0 to see them 😉

Note: I forgot to add the factgenie/loaders/practicald2t_st24.py in the PR, sorry for inconvenience.

Debugging tips

Change the logging level to DEBUG if you are developing prompts or you want to monitor annotations https://github.com/kasnerz/factgenie/blob/7faf6c75ccc8e57c74a5b1b42922b431f4dacd06/factgenie/config.yml#L15
Use the webroser UI and create the LLM eval campaign in the browser instead of running factgenie run-llm-eval from CLI. Start and stop the evaluation and adjust the prompt interactively in the config.

kasnerz / factgenie