Open ababaian opened 1 month ago
Table is too big to fit into one comment, so I will attach a file with an excerpt and the table as well as citations: Hypothesis Evaluation Methods.pdf
Implemented PyTrecEval Library (PyTrecEval repo can be found here)
.topic
and .qrels
files to evaluate retrieval results where:
.topic
file contains lines in the format: query_id
query
where:query_id
: an identifier that will be used to reference the query in the .qrels
filequery
: a str
that contains the query who's retrieval results will be evaluated.qrels
file contains lines in the format: query_id
0
file_id
score
where:query_id
: the relevant query id for the query found in the .topic file
0
: placeholder value required for formatting for PyTrecEvalfile_id
: The identifier of the SRA or Bioprojectscore
: a binary value that denotes if the SRA run ID or BioProject is relevant to the query, can be changed to better capture relevance with continuous or discrete valuesGiven the expected returned SRA runs or Bioprojects, and a complimentary predicted output from an LLM for each query, we can create and feed these two files to a function using the PyTrecEval library to output a calculation of multiple retrieval metrics.
Example output:
map all 0.5602
P_5 all 0.5143
P_10 all 0.4286
recall_5 all 0.4478
recall_10 all 0.4939
ndcg all 0.6199
The first column denotes the evaluation metric, the second is the query_id (in this case, it's the average across all queries), and the third is the score for that metric, which is between 0 and 1.
.qrels
file is manually formatted. This can be optimized into a pipeline where given a cypher query -> return automatically formatted file. The .topic
file can likely be automated in the same manner, where given a user query, have it automatically update the .topic
file.another quick test we'll want to evaluate will be comparing gpt4o vs. o1. From what I've read gpt4o handles large context dumps much better than o1. Both can used in combination using something like anthropic's contextual retrieval preprocessing
Task 3: Define a falsifiable, measurable hypothesis.
There is no unit of measurement provided here, as such, there is no criteria by which the hypothesis can be measured to fail as an outcome of the proposed work. Leading into the hypothesis you have to establish what are the evaluation metric.
What is the strengths and weaknesses of each of these methods, how will you establish this measurement?
Which embeddings? These are not introduced yet.
Sub-task is evident here. Create a Table with each evalutation criteria you can possible use for your task. You list many, for each one what are the:
Outcomes