Get Datasets and Dataloaders, num_samples should be a positive integer value, but got num_samples=0

nimafo commented 1 year ago

Hi, Thanks for the nice tutorial, I am following the steps and there is a cell in the notebook under Get Datasets and Dataloaders with the train, validation and test dataloaders. I face the following error:

ValueError Traceback (most recent call last) Cell In[6], line 10 7 s3dis_test = S3DIS(ROOT, area_nums='6', split='test', npoints=NUM_TEST_POINTS) 9 # get dataloaders ---> 10 train_dataloader = DataLoader(s3dis_train, batch_size=BATCH_SIZE, shuffle=True) 11 valid_dataloader = DataLoader(s3dis_valid, batch_size=BATCH_SIZE, shuffle=True) 12 test_dataloader = DataLoader(s3dis_test, batch_size=BATCH_SIZE, shuffle=False)

File ~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:351, in DataLoader.init(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers, pin_memory_device) 349 else: # map-style 350 if shuffle: --> 351 sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] 352 else: 353 sampler = SequentialSampler(dataset) # type: ignore[arg-type]

File ~/.local/lib/python3.10/site-packages/torch/utils/data/sampler.py:107, in RandomSampler.init(self, data_source, replacement, num_samples, generator) 103 raise TypeError("replacement should be a boolean value, but got " 104 "replacement={}".format(self.replacement)) 106 if not isinstance(self.num_samples, int) or self.num_samples <= 0: --> 107 raise ValueError("num_samples should be a positive integer " 108 "value, but got num_samples={}".format(self.num_samples))

ValueError: num_samples should be a positive integer value, but got num_samples=0

Do you know what is the reason?

itberrios commented 1 year ago

Thanks, I will take a look into this when I get a chance. Just want to confirm, did you create the partitioned dataset with this notebook?

nimafo commented 1 year ago

Thanks, I will take a look into this when I get a chance. Just want to confirm, did you create the partitioned dataset with this notebook?

Hi Isaac, yes.

itberrios commented 1 year ago

I took an initial look at it and did not receive any error.

However I believe I was able to reproduce the error by forcing the dataset to have a zero length. Can you check if len(s3dis_train) returns a value of 0? I think this might be a path issue where the actual dataset class (s3dis_dataset.py) is unable to find the path. I have updated s3dis_dataset.py to exit if the paths are invalid. It's not great, but it at least let's you know where the problem is (if it's even the problem you're having). Also feel free to make suggestions about how to handle this if you'd like.

Please check the file paths and let me know if this helps. If not I'll keep looking into it.

nimafo commented 1 year ago

The problem was simply the path separators. I am using Linux, thus changing \\ to / in the S3DIS class solves it.

nimafo commented 1 year ago

However, I still have another issue in the next cells of the same notebook. from this code:

total_train_targets = []
for (_, targets) in train_dataloader:
    total_train_targets += targets.reshape(-1).numpy().tolist()

total_train_targets = np.array(total_train_targets)

I get the following error:

I did not open a new issue for it not to spam. But tell me if it helps and I will put this in a new one.

KeyError Traceback (most recent call last) /SemanticSegmentation/3D/point_net/pointnet_seg.ipynb Cell 13 line 2 1 total_train_targets = [] ----> 2 for (_, targets) in train_dataloader: 3 print(1) 4 #total_train_targets += targets.reshape(-1).numpy().tolist() 5 6 #total_train_targets = np.array(total_train_targets)

File ~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:633, in _BaseDataLoaderIter.next(self) 630 if self._sampler_iter is None: 631 # TODO(https://github.com/pytorch/pytorch/issues/76750) 632 self._reset() # type: ignore[call-arg] --> 633 data = self._next_data() 634 self._num_yielded += 1 635 if self._dataset_kind == _DatasetKind.Iterable and \ 636 self._IterableDataset_len_called is not None and \ 637 self._num_yielded > self._IterableDataset_len_called:

File ~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:677, in _SingleProcessDataLoaderIter._next_data(self) 675 def _next_data(self): 676 index = self._next_index() # may raise StopIteration --> 677 data = self._dataset_fetcher.fetch(index) # may raise StopIteration 678 if self._pin_memory: 679 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:51, in _MapDatasetFetcher.fetch(self, possibly_batched_index) 49 data = self.dataset.getitems(possibly_batched_index) 50 else: ---> 51 data = [self.dataset[idx] for idx in possibly_batched_index] 52 else: 53 data = self.dataset[possibly_batched_index]

File ~/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:51, in (.0) 49 data = self.dataset.getitems(possibly_batched_index) 50 else: ---> 51 data = [self.dataset[idx] for idx in possibly_batched_index] 52 else: 53 data = self.dataset[possibly_batched_index]

File /SemanticSegmentation/3D/point_net/s3dis_dataset.py:53, in S3DIS.getitem(self, idx) 51 def getitem(self, idx): 52 # read data from hdf5 ---> 53 space_data = pd.read_hdf(self.data_paths[idx], key='space_slice').to_numpy() 54 points = space_data[:, :3] # xyz points 55 targets = space_data[:, 3] # integer categories

File ~/.local/lib/python3.10/site-packages/pandas/io/pytables.py:446, in read_hdf(path_or_buf, key, mode, errors, where, start, stop, columns, iterator, chunksize, **kwargs) 441 raise ValueError( 442 "key must be provided when HDF5 " 443 "file contains multiple datasets." 444 ) 445 key = candidate_only_group._v_pathname --> 446 return store.select( 447 key, 448 where=where, 449 start=start, 450 stop=stop, 451 columns=columns, 452 iterator=iterator, 453 chunksize=chunksize, 454 auto_close=auto_close, 455 ) 456 except (ValueError, TypeError, KeyError): 457 if not isinstance(path_or_buf, HDFStore): 458 # if there is an error, close the store if we opened it.

File ~/.local/lib/python3.10/site-packages/pandas/io/pytables.py:841, in HDFStore.select(self, key, where, start, stop, columns, iterator, chunksize, auto_close) 839 group = self.get_node(key) 840 if group is None: --> 841 raise KeyError(f"No object named {key} in the file") 843 # create the storer and axes 844 where = _ensure_term(where, scope_level=1)

KeyError: 'No object named space_slice in the file'

itberrios commented 1 year ago

Hello,

Glad you were able to resolve the previous issue. As for this new issue, it looks like the h5 files might not have saved with the expected keys. Each key is expected to be called "space_slice" since it's a slice/partition of the original space, try running this snippet below for a valid h5 file:

import h5py

with h5py.File(h5_path, 'r') as f:
    print(f.keys())

I get: <KeysViewHDF5 ['space_slice']>

Please let me know if you figure it out from this. If not please open this in a new issue and we'll work it from there. Don't worry about spamming with new issues, opening them separately helps me track them which in turn helps me learn to code better.

itberrios / 3D

Get Datasets and Dataloaders, num_samples should be a positive integer value, but got num_samples=0 #1