Documentation and code are so broken!

elsatch commented 2 months ago

Hi,

I have tried to reproduce the paper, or more specifically, follow the step by step instructions and unfortunately, nothing works.

As for the things I've detected so far in the python script version:

1.- Current requirements.txt can't be installed as instructed by the README.md file. because of conflicting library version. 2.- Sample document_filepath.tsv file in example_files has 6 examples and the column "Documents". 3.- Synthetic generation example code fails as the number of documents sampled is less that the given number --documents_sampled 10000 4.- If you change the number of documents_sampled to 5, so it doesn't fail, it will fail later as the step to generate the negative alternative requires at least 100 samples

So with the given documents in the example_files folder, it's impossible to generate a synthetic dataset.

Following the new vercel documentation at https://ares-ai.vercel.app/synth_gen/ is an absolute hit and miss, because of the copy pasted regions. For example in this page https://ares-ai.vercel.app/synth_gen/

The document paths alternate between data and /data, output and /output making the sample code fail
Sample dataset name is not correct. In your repo you have nq_ratio0.6.tsv and nq_ratio_0.5.tsv, but documentation uses nq_ratio0.5.tsv
Both the nq_ratio0.5.tsv and the nq_ratio0.6.tsv have less than 10000 documents, so the example command fails.
In the model choice, this section is copied right from the training classifier section and offers incorrect information.

But to make things even worse, the Python code in the ares-ai library is different from the Python scripts so if you try to run the code using the example_files/document_filepath.tsv this will fail too!! In the original file, you only need to pass a "Document" column so that ARES would generate the synthetic dataset, but now you also require a Query, Answer columns. Otherwise you would get the following error:

Error: The DataFrame is missing the following required column(s): Query, Answer.

So it seems like the requirements for ARES are quite more complex than expected. In the README file appears the following information:

"The ARES training pipeline is three steps:

Generate synthetic queries and answers from in-domain passages"

Then:

"A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal."

But to generate the synthetic dataset, it requires a query, document, and answer triples instead of a in-domain passages file as described.

There are tons of other inconsistencies, but given your code and documentation it's impossible to reproduce even the more basic examples.

elsatch commented 2 months ago

In the training classifier page, the full example code doesn't work:

The test_set_selection file used is not available in the repo: ratio_0.5_reformatted_full_articles_False_validation_with_negatives.tsv
Classification_dataset parameter should be a string not a list, otherwise you'll get the following error: TypeError: Parameter 'classification_dataset' for classifier_model is expected to be of type str, received list instead.
There is a missing parameter required to run the script: ValueError: Missing required parameter 'validation_set' for classifier_model. The current code in the library uses validation_set instead of test_set_selection.
Label_column should not be a list but a string.

elsatch commented 2 months ago

Just to clarify the situation, the documentation at the vercel site is related to the new-dev branch. The legacy documentation in the README.md relates to the scripts only.

I am finding my way on the new documentation, trying to fix typos and routes on the new-dev branch.

ViceSilva commented 2 months ago

Hello,

I've also encountered issues trying to reproduce the results of the paper using the code from this repository's main branch. Do you think using the new developers' branch is better suited for this purpose?

elsatch commented 2 months ago

After reviewing the codebase, it seems like the new-dev creates a new abstraction layer on top of the existing scripts. So I would say that new-dev brach is they way to move forward. This is the existing relationship:

ares.py │ ├── synthetic_generator.py	└── LLM_as_a_Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py │ ├── LLM_as_a_Judge_Adaptation/LLM_Generation_Functions.py │ └── LLM_as_a_Judge_Adaptation/Filter_Synthetic_Queries.py │ ├── binary_classifier.py │ └── LLM_as_a_Judge_Adaptation/General_Binary_Classifier.py │ ├── rag_scoring.py │ └── RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py │ └── RAG_Automatic_Evaluation/Evaluation_Functions.py	└── RAG_Automatic_Evaluation/ppi.py

└── ues_idp.py └── RAG_Automatic_Evaluation/Evaluation_Functions.py

elsatch commented 2 months ago

After the last update, it makes no sense to keep fixing issues in the legacy docs. Let's review the last updates to see how many of these issues remain relevant!

stanford-futuredata / ARES

Documentation and code are so broken! #23