serratus-bio / open-virome

monorepo for data explorer UI and APIs
http://openvirome.com/
GNU Affero General Public License v3.0
0 stars 0 forks source link

[LLM] Define a falsifiable, measurable hypothesis #142

Open ababaian opened 1 month ago

ababaian commented 1 month ago

Task 3: Define a falsifiable, measurable hypothesis.

Our first hypothesis questions the validity of using an AI model for querying a database at all, and whether an LLM can effectively retrieve and interpret sample metadata and research abstracts from a given local cluster of nodes in the graph.

There is no unit of measurement provided here, as such, there is no criteria by which the hypothesis can be measured to fail as an outcome of the proposed work. Leading into the hypothesis you have to establish what are the evaluation metric.

In terms of general LLM performance, metrics like precision, recall, f1 score and ROUGE-n [5] can be used to measure output.

What is the strengths and weaknesses of each of these methods, how will you establish this measurement?

tools like cosine-similarity and BERTscore can measure contextual similarities of embeddings.

Which embeddings? These are not introduced yet.

If the evaluation scores of the LLM outputs are deemed sufficient... You must define the threshold of "sufficient" as an exact number, and this value must be defined prior to starting any experiment.

Sub-task is evident here. Create a Table with each evalutation criteria you can possible use for your task. You list many, for each one what are the:

Outcomes

Stufedpanda commented 1 month ago

Table is too big to fit into one comment, so I will attach a file with an excerpt and the table as well as citations: Hypothesis Evaluation Methods.pdf

Stufedpanda commented 1 week ago

As of November 13th, 2024

Implemented PyTrecEval Library (PyTrecEval repo can be found here)

Given the expected returned SRA runs or Bioprojects, and a complimentary predicted output from an LLM for each query, we can create and feed these two files to a function using the PyTrecEval library to output a calculation of multiple retrieval metrics.

Example output:

map                      all     0.5602
P_5                      all     0.5143
P_10                     all     0.4286
recall_5                 all     0.4478
recall_10                all     0.4939
ndcg                     all     0.6199

The first column denotes the evaluation metric, the second is the query_id (in this case, it's the average across all queries), and the third is the score for that metric, which is between 0 and 1.

Improvements

lukepereira commented 1 week ago

another quick test we'll want to evaluate will be comparing gpt4o vs. o1. From what I've read gpt4o handles large context dumps much better than o1. Both can used in combination using something like anthropic's contextual retrieval preprocessing