jina-ai / late-chunking

Code for explaining and evaluating late chunking (chunked pooling)
Apache License 2.0
244 stars 29 forks source link

Error when running testing #18

Closed tyu008 closed 1 month ago

tyu008 commented 1 month ago

I run the command "python3 run_chunked_eval.py --task-name TRECCOVIDChunked". Got the below errors, any suggestions?

ERROR:mteb.evaluation.MTEB:Error while evaluating TRECCOVIDChunked: Currently truncation is only implemented for documents without titles Traceback (most recent call last): File "/raid/tanyu/late-chunking/run_chunked_eval.py", line 167, in main() File "/home/tanyu/.local/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/home/tanyu/.local/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/tanyu/.local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/tanyu/.local/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "/raid/tanyu/late-chunking/run_chunked_eval.py", line 129, in main evaluation.run( File "/home/tanyu/.local/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 422, in run raise e File "/home/tanyu/.local/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 383, in run results, tick, tock = self._run_eval( File "/home/tanyu/.local/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 260, in _run_eval results = task.evaluate( File "/raid/tanyu/late-chunking/chunked_pooling/mteb_chunked_eval.py", line 95, in evaluate scores[hf_subset] = self._evaluate_monolingual( File "/raid/tanyu/late-chunking/chunked_pooling/mteb_chunked_eval.py", line 162, in _evaluate_monolingual corpus = self._truncate_documents(corpus) File "/raid/tanyu/late-chunking/chunked_pooling/mteb_chunked_eval.py", line 110, in _truncate_documents raise NotImplementedError( NotImplementedError: Currently truncation is only implemented for documents without titles

guenthermi commented 1 month ago

Sorry the last commit enabled truncation by default, however, the current code doesn't support truncation for benchmark datasets (like TRECCOVID) where the documents have a title and a text. For the experiments in the paper, we run TRECCOID without truncation. Nevertheless, it should work when you disable truncation by running:

python3 run_chunked_eval.py --task-name TRECCOVIDChunked --truncate-max-length 0

I will also make a PR ready to fix it.

guenthermi commented 1 month ago

Ok we merged this one: https://github.com/jina-ai/late-chunking/pull/19 Now it should also work without the additional truncate-max-length argument

tyu008 commented 1 month ago

Thanks for your quick response! It works now @guenthermi

tyu008 commented 1 month ago

Another issue comes when I run "python3 run_chunked_eval.py --task-name SciFactChunked --truncate-max-length 0"

" File "/late-chunking/chunked_pooling/mteb_chunked_eval.py", line 254, in _evaluate_monolingual max_k = int(max(k_values) / max_chunks)


ZeroDivisionError: division by zero"

@guenthermi  could you help check it? Thanks
guenthermi commented 1 month ago

Ok without --truncate-max-length 0 is should have worked fine on the last version, now I made a small commit to make truncate_max_length=0 equivalent to truncate_max_length=None. It seems like in some cases it really truncated to 0 tokens instead of disabling it.