Missing index.json in dataset shared on drive

Hello,
Thank you for sharing the dataset used for pruning. Trying to use that dataset (by setting DATA_DIR in pruning.sh) results in the below error however -
Building train loader...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                          /LLM-Shearing/llmshearing/train.py:317 in <module>                      │
│                                                                                                  │
│   314 │   os.makedirs(save_dir, exist_ok=True)                                                   │
│   315 │   torch.save(cfg, save_dir + "/config.pt")                                               │
│   316 │                                                                                          │
│ ❱ 317 │   main(cfg)                                                                              │
│   318                                                                                            │
│   319                                                                                            │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/train.py:201 in main                          │
│                                                                                                  │
│   198 │                                                                                          │
│   199 │   # Dataloaders                                                                          │
│   200 │   print('Building train loader...')                                                      │
│ ❱ 201 │   train_loader = build_text_dataloader(cfg.train_loader,                                 │
│   202 │   │   │   │   │   │   │   │   │   │    cfg.device_train_batch_size,                      │
│   203 │   │   │   │   │   │   │   │   │   │    cfg.callbacks.data_loading.dynamic,               │
│   204 │   │   │   │   │   │   │   │   │   │    cfg.callbacks.data_loading.set_names,             │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/load_text_dataloader.py:36 in        │
│ build_text_dataloader                                                                            │
│                                                                                                  │
│    33 │   """                                                                                    │
│    34 │                                                                                          │
│    35 │   if dynamic:                                                                            │
│ ❱  36 │   │   dataset = TextDynamicStreamingDataset(local=cfg.dataset.local,                     │
│    37 │   │   │   │   │   │   │   │   │   │   │     max_seq_len=cfg.dataset.max_seq_len,         │
│    38 │   │   │   │   │   │   │   │   │   │   │     batch_size=device_batch_size,                │
│    39 │   │   │   │   │   │   │   │   │   │   │     shuffle=cfg.dataset.get(                     │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/streaming_dataset.py:415 in __init__ │
│                                                                                                  │
│   412 │   │   │   │    is_uint16: bool = False):                                                 │
│   413 │   │                                                                                      │
│   414 │   │   # Build Dataset                                                                    │
│ ❱ 415 │   │   super().__init__(local=local,                                                      │
│   416 │   │   │   │   │   │    shuffle=shuffle,                                                  │
│   417 │   │   │   │   │   │    shuffle_seed=shuffle_seed,                                        │
│   418 │   │   │   │   │   │    num_canonical_nodes=num_canonical_nodes,                          │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/streaming_dataset.py:114 in __init__ │
│                                                                                                  │
│   111 │   │   │   │    proportion: List[float] = None) -> None:                                  │
│   112 │   │                                                                                      │
│   113 │   │   streams = [Stream(local=local, split=set_name, repeat=1.0) for set_name in set_n   │
│ ❱ 114 │   │   super().__init__(streams=streams,                                                  │
│   115 │   │   │   │   │   │    split=None,                                                       │
│   116 │   │   │   │   │   │    num_canonical_nodes=num_canonical_nodes,                          │
│   117 │   │   │   │   │   │    batch_size=batch_size,                                            │
│                                                                                                  │
│                          /lib/python3.10/site-packages/streaming/base/dataset.py:443 in    │
│ __init__                                                                                         │
│                                                                                                  │
│    440 │   │   self.sample_offset_per_stream = np.zeros(self.num_streams, np.int64)              │
│    441 │   │   self.samples_per_stream = np.zeros(self.num_streams, np.int64)                    │
│    442 │   │   for stream_id, stream in enumerate(self.streams):                                 │
│ ❱  443 │   │   │   stream_shards = stream.get_shards(world)                                      │
│    444 │   │   │   num_stream_samples = sum(map(len, stream_shards))                             │
│    445 │   │   │   if not num_stream_samples:                                                    │
│    446 │   │   │   │   index_filename = os.path.join(stream.local, stream.split, get_index_base  │
│                                                                                                  │
│                                /lib/python3.10/site-packages/streaming/base/stream.py:437 in     │
│ get_shards                                                                                       │
│                                                                                                  │
│   434 │   │   │   │   │   os.rename(tmp_filename, filename)                                      │
│   435 │   │   │   │   else:                                                                      │
│   436 │   │   │   │   │   if not os.path.exists(filename):                                       │
│ ❱ 437 │   │   │   │   │   │   raise RuntimeError(f'No `remote` provided, but local file {filen   │
│   438 │   │   │   │   │   │   │   │   │   │      'does not exist either')                        │
│   439 │   │   │   else:                                                                          │
│   440 │   │   │   │   wait_for_file_to_exist(                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No `remote` provided, but local file /for_prune/cc/index.json does not exist either
If this is not the correct way to use this shared dataset, could you let me know what's the recommend way to use this shared dataset to reproduce the results? Or perhaps could you upload the dataset including the index.json files?
princeton-nlp / LLM-Shearing

Missing index.json in dataset shared on drive #40