[LLM] Define a falsifiable, measurable hypothesis

ababaian commented 1 month ago

Task 3: Define a falsifiable, measurable hypothesis.

Our first hypothesis questions the validity of using an AI model for querying a database at all, and whether an LLM can effectively retrieve and interpret sample metadata and research abstracts from a given local cluster of nodes in the graph.

There is no unit of measurement provided here, as such, there is no criteria by which the hypothesis can be measured to fail as an outcome of the proposed work. Leading into the hypothesis you have to establish what are the evaluation metric.

In terms of general LLM performance, metrics like precision, recall, f1 score and ROUGE-n [5] can be used to measure output.

What is the strengths and weaknesses of each of these methods, how will you establish this measurement?

tools like cosine-similarity and BERTscore can measure contextual similarities of embeddings.

Which embeddings? These are not introduced yet.

If the evaluation scores of the LLM outputs are deemed sufficient... You must define the threshold of "sufficient" as an exact number, and this value must be defined prior to starting any experiment.

Sub-task is evident here. Create a Table with each evalutation criteria you can possible use for your task. You list many, for each one what are the:

Required input to be provided
Assumptions of the model/evaluation metric
Tools which implement this evaluation metric
Threshold values which are accepted as "success" criteria and "fail" criteria and "uncertain" criteria as they are used in the literature (find examples).
Limitations or drawbacks of this method

Outcomes

Comparison table of each evaluation method for the task comprised of:
- Required input needed for evaluation
- Any assumptions made by the evaluation method
- Tools which implement this evaluation metric
- Precedent threshold values from literature that explicitly detail a success/fail/uncertain criteria
- Limitations/drawbacks of evaluation method

Stufedpanda commented 1 month ago

Table is too big to fit into one comment, so I will attach a file with an excerpt and the table as well as citations: Hypothesis Evaluation Methods.pdf

Stufedpanda commented 1 week ago

As of November 13th, 2024

Implemented PyTrecEval Library (PyTrecEval repo can be found here)

Takes in .topic and .qrels files to evaluate retrieval results where:
- The .topic file contains lines in the format: query_id query where:
- query_id: an identifier that will be used to reference the query in the .qrels file
- query: a str that contains the query who's retrieval results will be evaluated
- The .qrels file contains lines in the format: query_id 0 file_id score where:
- query_id: the relevant query id for the query found in the .topic file
- 0: placeholder value required for formatting for PyTrecEval
- file_id: The identifier of the SRA or Bioproject
- score: a binary value that denotes if the SRA run ID or BioProject is relevant to the query, can be changed to better capture relevance with continuous or discrete values

Given the expected returned SRA runs or Bioprojects, and a complimentary predicted output from an LLM for each query, we can create and feed these two files to a function using the PyTrecEval library to output a calculation of multiple retrieval metrics.

Example output:

map                      all     0.5602
P_5                      all     0.5143
P_10                     all     0.4286
recall_5                 all     0.4478
recall_10                all     0.4939
ndcg                     all     0.6199

The first column denotes the evaluation metric, the second is the query_id (in this case, it's the average across all queries), and the third is the score for that metric, which is between 0 and 1.

Improvements

[ ] Currently, the .qrels file is manually formatted. This can be optimized into a pipeline where given a cypher query -> return automatically formatted file. The .topic file can likely be automated in the same manner, where given a user query, have it automatically update the .topic file.
[ ] The library is supposed to allow for the calculation of the F-1 metric (balancing the precision and recall metrics) which wasn't working when I implemented the library so having that metric would be nice.
[ ] Implement a conversation-like pipeline for the LLMs (Running LLM output through multiple iterations of LLMs, keeping conversational history, or other similar methods) to better refine outputs.
[ ] Improve context prompts given to LLMs.

lukepereira commented 1 week ago

another quick test we'll want to evaluate will be comparing gpt4o vs. o1. From what I've read gpt4o handles large context dumps much better than o1. Both can used in combination using something like anthropic's contextual retrieval preprocessing

serratus-bio / open-virome