Open bhattg opened 4 days ago
Hi Can you just try the hf datasets command: (I don't see any lines from Bergen in the error lines)
dataset = datasets.load_dataset('kilt_wikipedia')
Thank you for the fast response. The issue still persists when loaded using HF dataset commands. Following is the full stack trace
python3 bergen.py retriever=splade-v3 reranker=debertav3 dataset=popqa generator=vllm_SOLAR-107B
[2024-11-25 07:42:40,480][datasets][INFO] - PyTorch version 2.3.0 available.
[2024-11-25 07:42:40,481][datasets][INFO] - TensorFlow version 2.8.0 available.
Unfinished experiment_folder: experiments/tmp_460be916ccb7601b
experiment_folder experiments/460be916ccb7601b
run_name: null
dataset_folder: datasets/
index_folder: indexes/
runs_folder: runs/
generated_query_folder: generated_queries/
processed_context_folder: processed_contexts/
experiments_folder: experiments/
retrieve_top_k: 50
rerank_top_k: 50
generation_top_k: 5
pyserini_num_threads: 20
processing_num_proc: 40
retriever:
init_args:
_target_: models.retrievers.splade.Splade
model_name: naver/splade-v3
max_len: 128
batch_size: 64
batch_size_sim: 512
reranker:
init_args:
_target_: models.rerankers.crossencoder.CrossEncoder
model_name: naver/trecdl22-crossencoder-debertav3
max_len: 256
batch_size: 64
generator:
init_args:
_target_: models.generators.vllm.VLLM
model_name: Upstage/SOLAR-10.7B-Instruct-v1.0
max_new_tokens: 128
max_length: 4096
batch_size: 256
dataset:
train:
doc: null
query: null
dev:
doc:
init_args:
_target_: modules.dataset_processor.KILT100w
split: full
query:
init_args:
_target_: modules.processors.qa_dataset_processor.POPQA
split: test
test:
doc: null
query: null
prompt:
system: You are a helpful assistant. Your task is to extract relevant information
from provided documents and to answer to questions as briefly as possible.
user: f"Background:\n{docs}\n\nQuestion:\ {question}"
system_without_docs: You are a helpful assistant. Answer the questions as briefly
as possible.
user_without_docs: f"Question:\ {question}"
Processing dataset kilt-100w in full split
Downloading data: 9%|███████████████▌ | 3.38G/37.3G [05:00<49:22, 11.5MB/s
]Error executing job with overrides: ['retriever=splade-v3', 'reranker=debertav3', 'dataset=popqa', 'generator=vllm_SOLAR-107B']
Traceback (most recent call last):
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 344, in _wait
await waiter
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/implementations/http.py", line 262, in _get_file
chunk = await r.content.read(chunk_size)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 425, in read
await self._wait("read")
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 343, in _wait
with self._timer:
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/helpers.py", line 671, in __exit__
raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gbhatt2/bergen/bergen.py", line 24, in main
rag = RAG(**config, config=config)
File "/home/gbhatt2/bergen/modules/rag.py", line 159, in __init__
self.datasets = ProcessDatasets.process(
File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 624, in process
dataset = processor.get_dataset()
File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 92, in get_dataset
dataset = self.process()
File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 282, in process
dataset = datasets.load_dataset(hf_name, num_proc=self.num_proc)[self.split]
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
builder_instance.download_and_prepare(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
self._download_and_prepare(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare
super()._download_and_prepare(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 978, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/data/gbhatt2/HF_HOME/modules/datasets_modules/datasets/kilt_wikipedia/2538d1b7191d2e7570a1e928e50d7d7751d24f2b2292f0e91ee566af5ebf0183/kilt_wikipedia.py", line 129, in _split_generators
downloaded_path = dl_manager.download_and_extract(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 326, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 159, in download
downloaded_path_or_paths = map_nested(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 484, in map_nested
mapped = function(data_struct)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 219, in _download_batched
return [
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 220, in <listcomp>
self._download_single(url_or_filename, download_config=download_config)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 229, in _download_single
out = cached_path(url_or_filename, download_config=download_config)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 205, in cached_path
output_path = get_from_cache(
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 411, in get_from_cache
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 330, in fsspec_get
fs.get_file(path, temp_file.name, callback=callback)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 101, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I can download the .json
file independently by performing
wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json
Can you provide the version of the dataset, I'd try to see if I can replicate the things with exactly that version. My dataset version is v3.1.0
I was facing the same issue as you but now I can download the dataset. network issue? can you try again?
Tried today with dataset version '2.19.1' and it worked . @bhattg any update?
Thanks for the fast responses. I am still unable to download the dataset with 3.1.0
and get stuck at the exact same spot. I switched to 2.19.1
as per the recommendation and it does not get stuck at the very spot now. I am still downloading the full dataset, and will post it, if there are any issues.
Edit:
I was able to get it working with 2.19.1
but it still throws an error at the latest version 3.1.0
.
Hi!
I've been trying to download the kilt_wikipedia, however, everytime I try to download the dataset, it gets at exact 3.39G download progress bar. Any suggestions?
I am using
transformers v4.46.3
and torchv2.3.0+cu121