#!/bin/bash
# Define arrays for names and embedding sizes
names=(LEMBWikimQARetrievalChunked LEMBQMSumRetrievalChunked LEMBNarrativeQARetrievalChunked LEMBSummScreenFDRetrievalChunked)
# Loop over each name
for name in "${names[@]}"; do
echo $name
python3 run_chunked_eval_with_macro_chunks.py --task-name $name
done
to run them all at once. Then the results can be displayed graphically in a matplotlib plot via running the file plot_chunk_size_experiments.py.
Macro chunking approach vs 'hard' boundary approach with 0 overlap
Similar to the above - comparing macro chunking to non-macro chunking, with experiment file run_macro_chunking_experiments.py and plot file plot_macro_chunking_experiments.py.
Example with Anthropics contextual retrieval
You can run the explanatory_contextual_retrieval.py to see a comparison between Anthropics contextual retrieval, which manually adds context to each chunk, late chunking, and naive chunking. This is performed via a running on a generated document which deliberately has context missing in later sentences (via 'Its' instead of a company name). The comparison is via cosine similarities on the chunks and the corresponding embeddings on jina-embeddings-v2-base-en.
Overview
Experiments added:
LongEmbed Examples against chunk size (nDCG@10 and mAP@10)
Similarly to
run_chunked_eval.py
,run_chunked_eval_with_macro_chunking.py
can just be run on the command line with e.g.To reproduce easily
I recommend the bash file
to run them all at once. Then the results can be displayed graphically in a matplotlib plot via running the file
plot_chunk_size_experiments.py
.Macro chunking approach vs 'hard' boundary approach with 0 overlap
Similar to the above - comparing macro chunking to non-macro chunking, with experiment file
run_macro_chunking_experiments.py
and plot fileplot_macro_chunking_experiments.py
.Example with Anthropics contextual retrieval
You can run the
explanatory_contextual_retrieval.py
to see a comparison between Anthropics contextual retrieval, which manually adds context to each chunk, late chunking, and naive chunking. This is performed via a running on a generated document which deliberately has context missing in later sentences (via 'Its' instead of a company name). The comparison is via cosine similarities on the chunks and the corresponding embeddings onjina-embeddings-v2-base-en
.