princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

cannot reshape array of size 4 into shape (1,newaxis,8) #38

Closed rzr002 closed 8 months ago

rzr002 commented 8 months ago

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/train.py:317 in │ │ │ │ 314 │ os.makedirs(save_dir, exist_ok=True) │ │ 315 │ torch.save(cfg, save_dir + "/config.pt") │ │ 316 │ │ │ ❱ 317 │ main(cfg) │ │ 318 │ │ 319 │ │ │ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/train.py:301 in main │ │ │ │ 298 │ │ trainer.eval() │ │ 299 │ │ │ 300 │ print('Starting training...') │ │ ❱ 301 │ trainer.fit() │ │ 302 │ │ │ 303 │ print('Done.') │ │ 304 │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/c │ │ omposer/trainer/trainer.py:1876 in fit │ │ │ │ 1873 │ │ │ self.state.scaler = ClosureGradScaler() if self._use_closures() else GradSca │ │ 1874 │ │ │ │ 1875 │ │ self.first_batch_complete = False │ │ ❱ 1876 │ │ self._train_loop() │ │ 1877 │ │ │ 1878 │ def close(self): │ │ 1879 │ │ """Shutdown the trainer. │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/c │ │ omposer/trainer/trainer.py:2018 in _train_loop │ │ │ │ 2015 │ │ │ │ if isinstance(dataloader, DataLoader) and isinstance(dataloader.sampler, │ │ 2016 │ │ │ │ │ dataloader.sampler.set_epoch(int(self.state.timestamp.epoch)) │ │ 2017 │ │ │ │ │ │ ❱ 2018 │ │ │ │ for batch_idx, self.state.batch in enumerate(self._iter_dataloader(Train │ │ 2019 │ │ │ │ │ # Spin dataloader forward unless dataloader handles internally with │ │ 2020 │ │ │ │ │ if self.spin_dataloaders and 'train' not in self.state.dataset_resum │ │ 2021 │ │ │ │ │ │ │ self.state.timestamp.batch_in_epoch): │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/c │ │ omposer/trainer/trainer.py:3024 in _iter_dataloader │ │ │ │ 3021 │ │ │ │ # [BEFORE/AFTER]_DATALOADER only runs while training │ │ 3022 │ │ │ │ if trainer_mode == TrainerMode.TRAIN: │ │ 3023 │ │ │ │ │ self.engine.run_event(Event.BEFORE_DATALOADER) │ │ ❱ 3024 │ │ │ │ batch = next(dataloader_iter) │ │ 3025 │ │ │ except StopIteration: │ │ 3026 │ │ │ │ # [BEFORE/AFTER]_DATALOADER only runs while training │ │ 3027 │ │ │ │ if trainer_mode == TrainerMode.TRAIN: │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/t │ │ orch/utils/data/dataloader.py:633 in next │ │ │ │ 630 │ │ │ if self._sampler_iter is None: │ │ 631 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/76750) │ │ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │ │ ❱ 633 │ │ │ data = self._next_data() │ │ 634 │ │ │ self._num_yielded += 1 │ │ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │ │ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/t │ │ orch/utils/data/dataloader.py:677 in _next_data │ │ │ │ 674 │ │ │ 675 │ def _next_data(self): │ │ 676 │ │ index = self._next_index() # may raise StopIteration │ │ ❱ 677 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │ │ 678 │ │ if self._pin_memory: │ │ 679 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │ │ 680 │ │ return data │ │ │ │ /mnt/workspace/workgroup/qianli.myf/anaconda3/envs/rzr_llmshearing/lib/python3.9/site-packages/t │ │ orch/utils/data/utils/fetch.py:32 in fetch │ │ │ │ 29 │ │ │ data = [] │ │ 30 │ │ │ for in possibly_batched_index: │ │ 31 │ │ │ │ try: │ │ ❱ 32 │ │ │ │ │ data.append(next(self.dataset_iter)) │ │ 33 │ │ │ │ except StopIteration: │ │ 34 │ │ │ │ │ self.ended = True │ │ 35 │ │ │ │ │ break │ │ │ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/datasets/streaming_dataset.py:384 │ │ in iter │ │ │ │ 381 │ │ epoch, used_sample_ids = self._resume_incr_epoch(world) │ │ 382 │ │ │ │ 383 │ │ # Get this worker's partition of samples to process. │ │ ❱ 384 │ │ sample_ids_per_stream = self._get_work(world, epoch, used_sample_ids) │ │ 385 │ │ │ │ 386 │ │ # Currently only supports dynamically loading data from each domain for once. │ │ 387 │ │ # Issues could occur if one domain of data is used up. │ │ │ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/datasets/streaming_dataset.py:338 │ │ in _get_work │ │ │ │ 335 │ │ │ │ 336 │ │ # Do expensive work that may use a lot of cores/memory just once, in the local l │ │ 337 │ │ if world.is_local_leader: │ │ ❱ 338 │ │ │ sample_ids_per_stream = generate_work(self, world, epoch, used_domain_ids) │ │ 339 │ │ │ shape_shms, data_shms = self._share_work(sample_ids_per_stream) │ │ 340 │ │ │ self._shared_barrier(world.workers_per_node) │ │ 341 │ │ else: │ │ │ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/datasets/streaming_dataset.py:78 │ │ in generate_work │ │ │ │ 75 │ │ │ del reverse_mapping │ │ 76 │ │ │ │ 77 │ │ # check │ │ ❱ 78 │ │ stream_partition = get_partitions_orig(samples_in_stream, │ │ 79 │ │ │ │ │ │ │ │ │ │ │ dataset.num_canonical_nodes, world.num_no │ │ 80 │ │ │ │ │ │ │ │ │ │ │ world.ranks_per_node, world.workers_per_r │ │ 81 │ │ │ │ │ │ │ │ │ │ │ 0, used_stream_ids) │ │ │ │ /mnt/workspace/workgroup/qianqin.rzr/LLM-Shearing/llmshearing/datasets/partition.py:116 in │ │ get_partitions_orig │ │ │ │ 113 │ │ underflow = ranks_per_node - overflow │ │ 114 │ │ last = ids[:, -ranks_per_node - underflow + 1:-ranks_per_node + 1] │ │ 115 │ │ ids = np.concatenate([ids, last], 1) │ │ ❱ 116 │ ids = ids.reshape(num_physical_nodes, -1, ranks_per_node) │ │ 117 │ │ │ 118 │ # Pad with -1 adequately for reshaping across workers. │ │ 119 │ # │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: cannot reshape array of size 4 into shape (1,newaxis,8)

anyone meet this problem,ask for help

xiamengzhou commented 8 months ago

It seems that your data size is toooooo small? what is your data set size?

rzr002 commented 8 months ago

I'm not sure why this error occurred; perhaps, as you said, it was due to insufficient data. However, previously I used 6 domains from slim pajamas along with 5 domains of Chinese data I collected myself, with 0.2B data for each domain. But now, after restarting the environment and removing the Chinese data, I only used the English data from the 6 domains. After reprocessing the data, the system is running normally again. If this error reoccurs, I will return to provide an update. Please translate this into English.

Here is the English translation:

"I'm not sure why this error occurred; it might have been due to a lack of data as you suggested. Previously, I used data from 6 domains from 'slim pajamas' and added data from 5 Chinese domains that I collected, with each domain having 0.2 billion data points. But now, after restarting the environment and eliminating the Chinese data, I only utilized the English data from the 6 domains. After reprocessing the data, it's working normally again. If this error happens again in the future, I'll come back to provide an answer."

It seems that your data size is toooooo small? what is your data set size?

xiamengzhou commented 8 months ago

Feel free to reopen it if more issues occur!

18140663659 commented 6 months ago

i use the sample_redpajama as data and meet this same question, so is this because this data is small?

xiamengzhou commented 6 months ago

Yes, the sample_redpajama is a showcase on how to do the data processing. You need more data when actually running the experiments to avoid shaping or loading issues.