princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
548 stars 43 forks source link

Missing index.json in dataset shared on drive #40

Closed AnonNoNameAccount closed 10 months ago

AnonNoNameAccount commented 10 months ago

Hello,

Thank you for sharing the dataset used for pruning. Trying to use that dataset (by setting DATA_DIR in pruning.sh) results in the below error however -

Building train loader...
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                          /LLM-Shearing/llmshearing/train.py:317 in <module>                      │
│                                                                                                  │
│   314 │   os.makedirs(save_dir, exist_ok=True)                                                   │
│   315 │   torch.save(cfg, save_dir + "/config.pt")                                               │
│   316 │                                                                                          │
│ ❱ 317 │   main(cfg)                                                                              │
│   318                                                                                            │
│   319                                                                                            │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/train.py:201 in main                          │
│                                                                                                  │
│   198 │                                                                                          │
│   199 │   # Dataloaders                                                                          │
│   200 │   print('Building train loader...')                                                      │
│ ❱ 201 │   train_loader = build_text_dataloader(cfg.train_loader,                                 │
│   202 │   │   │   │   │   │   │   │   │   │    cfg.device_train_batch_size,                      │
│   203 │   │   │   │   │   │   │   │   │   │    cfg.callbacks.data_loading.dynamic,               │
│   204 │   │   │   │   │   │   │   │   │   │    cfg.callbacks.data_loading.set_names,             │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/load_text_dataloader.py:36 in        │
│ build_text_dataloader                                                                            │
│                                                                                                  │
│    33 │   """                                                                                    │
│    34 │                                                                                          │
│    35 │   if dynamic:                                                                            │
│ ❱  36 │   │   dataset = TextDynamicStreamingDataset(local=cfg.dataset.local,                     │
│    37 │   │   │   │   │   │   │   │   │   │   │     max_seq_len=cfg.dataset.max_seq_len,         │
│    38 │   │   │   │   │   │   │   │   │   │   │     batch_size=device_batch_size,                │
│    39 │   │   │   │   │   │   │   │   │   │   │     shuffle=cfg.dataset.get(                     │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/streaming_dataset.py:415 in __init__ │
│                                                                                                  │
│   412 │   │   │   │    is_uint16: bool = False):                                                 │
│   413 │   │                                                                                      │
│   414 │   │   # Build Dataset                                                                    │
│ ❱ 415 │   │   super().__init__(local=local,                                                      │
│   416 │   │   │   │   │   │    shuffle=shuffle,                                                  │
│   417 │   │   │   │   │   │    shuffle_seed=shuffle_seed,                                        │
│   418 │   │   │   │   │   │    num_canonical_nodes=num_canonical_nodes,                          │
│                                                                                                  │
│                          /LLM-Shearing/llmshearing/datasets/streaming_dataset.py:114 in __init__ │
│                                                                                                  │
│   111 │   │   │   │    proportion: List[float] = None) -> None:                                  │
│   112 │   │                                                                                      │
│   113 │   │   streams = [Stream(local=local, split=set_name, repeat=1.0) for set_name in set_n   │
│ ❱ 114 │   │   super().__init__(streams=streams,                                                  │
│   115 │   │   │   │   │   │    split=None,                                                       │
│   116 │   │   │   │   │   │    num_canonical_nodes=num_canonical_nodes,                          │
│   117 │   │   │   │   │   │    batch_size=batch_size,                                            │
│                                                                                                  │
│                          /lib/python3.10/site-packages/streaming/base/dataset.py:443 in    │
│ __init__                                                                                         │
│                                                                                                  │
│    440 │   │   self.sample_offset_per_stream = np.zeros(self.num_streams, np.int64)              │
│    441 │   │   self.samples_per_stream = np.zeros(self.num_streams, np.int64)                    │
│    442 │   │   for stream_id, stream in enumerate(self.streams):                                 │
│ ❱  443 │   │   │   stream_shards = stream.get_shards(world)                                      │
│    444 │   │   │   num_stream_samples = sum(map(len, stream_shards))                             │
│    445 │   │   │   if not num_stream_samples:                                                    │
│    446 │   │   │   │   index_filename = os.path.join(stream.local, stream.split, get_index_base  │
│                                                                                                  │
│                                /lib/python3.10/site-packages/streaming/base/stream.py:437 in     │
│ get_shards                                                                                       │
│                                                                                                  │
│   434 │   │   │   │   │   os.rename(tmp_filename, filename)                                      │
│   435 │   │   │   │   else:                                                                      │
│   436 │   │   │   │   │   if not os.path.exists(filename):                                       │
│ ❱ 437 │   │   │   │   │   │   raise RuntimeError(f'No `remote` provided, but local file {filen   │
│   438 │   │   │   │   │   │   │   │   │   │      'does not exist either')                        │
│   439 │   │   │   else:                                                                          │
│   440 │   │   │   │   wait_for_file_to_exist(                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: No `remote` provided, but local file /for_prune/cc/index.json does not exist either

If this is not the correct way to use this shared dataset, could you let me know what's the recommend way to use this shared dataset to reproduce the results? Or perhaps could you upload the dataset including the index.json files?

AnonNoNameAccount commented 10 months ago

The files exist in google drive, but weren't downloading correct. Closing issue.