Complete eval - Githubissues

Jonathan-Adly commented 6 days ago

@HalemoGPA - as discussed, here is the goals of this task. We want to run the full eval via a simple command:

Evaluation:

python evaluate.py --api-key="my-key" and run the full Vidore evaluations.
python evaluate.py --api-key="my-key" --collection_name arxivqa_collection and run just the arxivqa portion. The same should be with the other benchmarks
For now - we can only accept specific eval keys and return error if someone tries to do evaluation with a non-eval key. We don't want someone getting billed by accident.

Upsertion:

Upsertion is a one time task, so it shouldn't be in evaluate.py. It should be a one-time script that once things are upserted, we don't have to worry about again ever.
Example: python upsert.py --api-key="my-key" and this will upsert the vidore documents in the background
python upsert.py --api-key="my-key" --collection_name arxivqa_collection and this will upsert only the arxiqa_collection
I wouldn't worry about parallelization or anything fancy here. Keep it super simple.

Report:

We want a table similar to the one in the ColPali paper with the current leader in the Vidore leaderboard, the original ColPali numbers/comparsions, and ours.
We want an average query latency benchmark. It should indicate the number of documents as more documents, larger latency.

HalemoGPA commented 3 days ago

@Jonathan-Adly

Evaluation:

~~python evaluate.py --api-key="my-key" and run the full Vidore evaluations.~~
~~python evaluate.py --api-key="my-key" --collection_name arxivqa_collection and run just the arxivqa portion. The same should be with the other benchmarks~~
~~For now - we can only accept specific eval keys and return error if someone tries to do evaluation with a non-eval key. We don't want someone getting billed by accident.~~

Upsertion: Code is done in main.py. to-do list here:

make upsert.py file with the same code.
add retry mechanism to the upsertion function.
build code to upsert the missed documents in main upsertion function.

Report:

We want a table similar to the one in the ColPali paper with the current leader in the Vidore leaderboard, the original ColPali numbers/comparsions, and ours. (Currently pursuing)
~~We want an average query latency benchmark. It should indicate the number of documents as more documents, larger latency.~~

We are almost there!

HalemoGPA commented 3 days ago

Report: - We want a table similar to the one in the ColPali paper with the current leader in the Vidore leaderboard, the original ColPali numbers/comparsions, and ours. (Currently pursuing)

Done

we still have upsertion edits.

tjmlabs / ColiVara-Eval

Complete eval #3