stanford-futuredata / ARES

https://ares-ai.vercel.app/
Apache License 2.0
372 stars 41 forks source link

Documentation and code are so broken! #23

Closed elsatch closed 2 months ago

elsatch commented 2 months ago

Hi,

I have tried to reproduce the paper, or more specifically, follow the step by step instructions and unfortunately, nothing works.

As for the things I've detected so far in the python script version:

1.- Current requirements.txt can't be installed as instructed by the README.md file. because of conflicting library version. 2.- Sample document_filepath.tsv file in example_files has 6 examples and the column "Documents". 3.- Synthetic generation example code fails as the number of documents sampled is less that the given number --documents_sampled 10000 4.- If you change the number of documents_sampled to 5, so it doesn't fail, it will fail later as the step to generate the negative alternative requires at least 100 samples

So with the given documents in the example_files folder, it's impossible to generate a synthetic dataset.

Following the new vercel documentation at https://ares-ai.vercel.app/synth_gen/ is an absolute hit and miss, because of the copy pasted regions. For example in this page https://ares-ai.vercel.app/synth_gen/

But to make things even worse, the Python code in the ares-ai library is different from the Python scripts so if you try to run the code using the example_files/document_filepath.tsv this will fail too!! In the original file, you only need to pass a "Document" column so that ARES would generate the synthetic dataset, but now you also require a Query, Answer columns. Otherwise you would get the following error:

Error: The DataFrame is missing the following required column(s): Query, Answer.

So it seems like the requirements for ARES are quite more complex than expected. In the README file appears the following information:

"The ARES training pipeline is three steps:​

Generate synthetic queries and answers from in-domain passages"

Then:

"A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal."

But to generate the synthetic dataset, it requires a query, document, and answer triples instead of a in-domain passages file as described.

There are tons of other inconsistencies, but given your code and documentation it's impossible to reproduce even the more basic examples.

elsatch commented 2 months ago

In the training classifier page, the full example code doesn't work:

elsatch commented 2 months ago

Just to clarify the situation, the documentation at the vercel site is related to the new-dev branch. The legacy documentation in the README.md relates to the scripts only.

I am finding my way on the new documentation, trying to fix typos and routes on the new-dev branch.

ViceSilva commented 2 months ago

Hello,

I've also encountered issues trying to reproduce the results of the paper using the code from this repository's main branch. Do you think using the new developers' branch is better suited for this purpose?

elsatch commented 2 months ago

After reviewing the codebase, it seems like the new-dev creates a new abstraction layer on top of the existing scripts. So I would say that new-dev brach is they way to move forward. This is the existing relationship:

ares.py │ ├── synthetic_generator.py └── LLM_as_a_Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py │ ├── LLM_as_a_Judge_Adaptation/LLM_Generation_Functions.py │ └── LLM_as_a_Judge_Adaptation/Filter_Synthetic_Queries.py │ ├── binary_classifier.py │ └── LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py │ ├── rag_scoring.py │ └── RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py │ └── RAG_Automatic_Evaluation/Evaluation_Functions.py └── RAG_Automatic_Evaluation/ppi.py

└── ues_idp.py └── RAG_Automatic_Evaluation/Evaluation_Functions.py

elsatch commented 2 months ago

After the last update, it makes no sense to keep fixing issues in the legacy docs. Let's review the last updates to see how many of these issues remain relevant!