naver / bergen

Benchmarking library for RAG
Other
127 stars 12 forks source link

Unable to download kilt_wikipedia dataset #35

Open bhattg opened 4 days ago

bhattg commented 4 days ago

Hi!

I've been trying to download the kilt_wikipedia, however, everytime I try to download the dataset, it gets at exact 3.39G download progress bar. Any suggestions?

Traceback (most recent call last):                                                                                                                                                                                                     
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 344, in _wait                                                                                                                           
    await waiter                                                                                                                                                                                                                      
asyncio.exceptions.CancelledError                                                                                                                                                                                                     

The above exception was the direct cause of the following exception:                                                                                                                                                                  

Traceback (most recent call last):                                                                                                                                                                                                    
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner                                                                                                                              
    result[0] = await coro                                                                                                                                                                                                            
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/implementations/http.py", line 262, in _get_file                                                                                                           
    chunk = await r.content.read(chunk_size)                                                                                                                                                                                          
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 425, in read                                                                                                                            
    await self._wait("read")                                                                                                                                                                                                          
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 343, in _wait                                                                                                                           
    with self._timer:                                                                                                                                                                                                                 
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/helpers.py", line 671, in __exit__                                                                                                                        
    raise asyncio.TimeoutError from exc_val                                                                                                                                                                                           
asyncio.exceptions.TimeoutError                                                                                                                                                                                                       

The above exception was the direct cause of the following exception:                                                                                                                                                                  

Traceback (most recent call last):                                                                                                                                                                                                    
  File "<stdin>", line 1, in <module>                                                                                                                                                                                                 
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset                                                                                                                     
    builder_instance.download_and_prepare(                                                                                                                                                                                            
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare                                                                                                           
    self._download_and_prepare(                                                                                                                                                                                                       
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare                                                                                                         
    super()._download_and_prepare(                                                                                                                                                                                                    
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 978, in _download_and_prepare                                                                                                          
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)                               
  File "/home/gbhatt2/.cache/huggingface/modules/datasets_modules/datasets/kilt_wikipedia/2538d1b7191d2e7570a1e928e50d7d7751d24f2b2292f0e91ee566af5ebf0183/kilt_wikipedia.py", line 129, in _split_generators
    downloaded_path = dl_manager.download_and_extract(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 326, in download_and_extract                                                                                         
    return self.extract(self.download(url_or_urls))
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 159, in download                                                                                                     
    downloaded_path_or_paths = map_nested(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 484, in map_nested                                                                                                              
    mapped = function(data_struct)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 219, in _download_batched                                                                                            
    return [
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 220, in <listcomp>                                                                                                   
    self._download_single(url_or_filename, download_config=download_config)                                        
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 229, in _download_single                                                                                             
    out = cached_path(url_or_filename, download_config=download_config)                                            
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 205, in cached_path                                                                                                           
    output_path = get_from_cache(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 411, in get_from_cache                                                                                                        
    fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)     
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 330, in fsspec_get                                                                                                            
    fs.get_file(path, temp_file.name, callback=callback)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper          
    return sync(self.loop, func, *args, **kwargs)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 101, in sync             
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

I am using transformers v4.46.3 and torch v2.3.0+cu121

DRRV commented 4 days ago

Hi Can you just try the hf datasets command: (I don't see any lines from Bergen in the error lines)

dataset = datasets.load_dataset('kilt_wikipedia')

bhattg commented 4 days ago

Thank you for the fast response. The issue still persists when loaded using HF dataset commands. Following is the full stack trace

python3 bergen.py  retriever=splade-v3 reranker=debertav3 dataset=popqa  generator=vllm_SOLAR-107B                                                                                                   

[2024-11-25 07:42:40,480][datasets][INFO] - PyTorch version 2.3.0 available.                                                                                                                                                          
[2024-11-25 07:42:40,481][datasets][INFO] - TensorFlow version 2.8.0 available.                                                                                                                                                       
Unfinished experiment_folder: experiments/tmp_460be916ccb7601b                                                                                                                                                                        
experiment_folder experiments/460be916ccb7601b                                                                                                                                                                                        
run_name: null                                                                                                                                                                                                                        
dataset_folder: datasets/                                                                                                                                                                                                             
index_folder: indexes/                                                                                                                                                                                                                
runs_folder: runs/                                                                                                                                                                                                                    
generated_query_folder: generated_queries/                                                                                                                                                                                            
processed_context_folder: processed_contexts/                                                                                                                                                                                         
experiments_folder: experiments/                                                                                                                                                                                                      
retrieve_top_k: 50                                                                                                                                                                                                                    
rerank_top_k: 50                                                                                                                                                                                                                      
generation_top_k: 5                                                                                                                                                                                                                   
pyserini_num_threads: 20                                                                                                                                                                                                              
processing_num_proc: 40                                                                                                                                                                                                               
retriever:                                                                                                                                                                                                                            
  init_args:                                                                                                                                                                                                                          
    _target_: models.retrievers.splade.Splade                                                                                                                                                                                         
    model_name: naver/splade-v3                                                                                                                                                                                                       
    max_len: 128                                                                                                                                                                                                                      
  batch_size: 64                                                                                                                                                                                                                      
  batch_size_sim: 512                                                                                                                                                                                                                 
reranker:                                                                                                                                                                                                                             
  init_args:                                                                                                                                                                                                                          
    _target_: models.rerankers.crossencoder.CrossEncoder                                                                                                                                                                              
    model_name: naver/trecdl22-crossencoder-debertav3                                                                                                                                                                                 
    max_len: 256                                                                                                                                                                                                                      
  batch_size: 64                                                                                                                                                                                                                      
generator:                                                                                                                                                                                                                            
  init_args:                                                                                                                                                                                                                          
    _target_: models.generators.vllm.VLLM       
    model_name: Upstage/SOLAR-10.7B-Instruct-v1.0
    max_new_tokens: 128
    max_length: 4096
    batch_size: 256
dataset:
  train:
    doc: null
    query: null
  dev:
    doc:
      init_args:
        _target_: modules.dataset_processor.KILT100w
        split: full
    query:
      init_args:
        _target_: modules.processors.qa_dataset_processor.POPQA
        split: test
  test:
    doc: null
    query: null
prompt:
  system: You are a helpful assistant. Your task is to extract relevant information
    from provided documents and to answer to questions as briefly as possible.
  user: f"Background:\n{docs}\n\nQuestion:\ {question}"
  system_without_docs: You are a helpful assistant. Answer the questions as briefly
    as possible.
  user_without_docs: f"Question:\ {question}"

Processing dataset kilt-100w in full split 
Downloading data:   9%|███████████████▌                                                                                                                                                           | 3.38G/37.3G [05:00<49:22, 11.5MB/s
]Error executing job with overrides: ['retriever=splade-v3', 'reranker=debertav3', 'dataset=popqa', 'generator=vllm_SOLAR-107B']
Traceback (most recent call last):
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 344, in _wait
    await waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/implementations/http.py", line 262, in _get_file
    chunk = await r.content.read(chunk_size)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 425, in read
    await self._wait("read")
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/streams.py", line 343, in _wait
    with self._timer:
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/aiohttp/helpers.py", line 671, in __exit__
    raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gbhatt2/bergen/bergen.py", line 24, in main 
    rag = RAG(**config, config=config)
  File "/home/gbhatt2/bergen/modules/rag.py", line 159, in __init__
    self.datasets = ProcessDatasets.process(
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 624, in process
    dataset = processor.get_dataset()
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 92, in get_dataset
    dataset = self.process()
  File "/home/gbhatt2/bergen/modules/dataset_processor.py", line 282, in process
    dataset = datasets.load_dataset(hf_name, num_proc=self.num_proc)[self.split]
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 1648, in _download_and_prepare
    super()._download_and_prepare(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/builder.py", line 978, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/data/gbhatt2/HF_HOME/modules/datasets_modules/datasets/kilt_wikipedia/2538d1b7191d2e7570a1e928e50d7d7751d24f2b2292f0e91ee566af5ebf0183/kilt_wikipedia.py", line 129, in _split_generators
    downloaded_path = dl_manager.download_and_extract(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 326, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 159, in download
    downloaded_path_or_paths = map_nested(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 484, in map_nested
    mapped = function(data_struct)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 219, in _download_batched
    return [
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 220, in <listcomp>
    self._download_single(url_or_filename, download_config=download_config)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/download/download_manager.py", line 229, in _download_single
    out = cached_path(url_or_filename, download_config=download_config)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 205, in cached_path
    output_path = get_from_cache(
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 411, in get_from_cache
    fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/datasets/utils/file_utils.py", line 330, in fsspec_get
    fs.get_file(path, temp_file.name, callback=callback)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/data/gbhatt2/.conda/envs/pyv2/lib/python3.10/site-packages/fsspec/asyn.py", line 101, in sync
    raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I can download the .json file independently by performing

wget http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json

Can you provide the version of the dataset, I'd try to see if I can replicate the things with exactly that version. My dataset version is v3.1.0

DRRV commented 4 days ago

I was facing the same issue as you but now I can download the dataset. network issue? can you try again?

sclincha commented 4 days ago

Tried today with dataset version '2.19.1' and it worked . @bhattg any update?

bhattg commented 4 days ago

Thanks for the fast responses. I am still unable to download the dataset with 3.1.0 and get stuck at the exact same spot. I switched to 2.19.1 as per the recommendation and it does not get stuck at the very spot now. I am still downloading the full dataset, and will post it, if there are any issues.

Edit:

I was able to get it working with 2.19.1 but it still throws an error at the latest version 3.1.0.