cifar100yaml failed to run

JingxuanKang commented 1 year ago

When I run cifar100 with Docker, it stops after print the src feat shape torch.Size([5000, 128]) torch.Size([5000]) num classes 100. (It did not stop, but it ran for a long time in this part still did not end) I read the code, and find that the problem is in creating subsets. def load_by_class(loader, num_classes): if len(train_set.getitem(0)) == 3: try: subsets = {target: torch.utils.data.Subset(trainset, [i for i, (x, y, ) in enumerate(train_set) if y == target]) for target in range(num_classes)} except: subsets = {target: torch.utils.data.Subset(trainset, [i for i, (x, y, ) in enumerate(train_set) if y.item() == target]) for target in range(num_classes)} else: try: subsets = {target: torch.utils.data.Subset(train_set, [i for i, (x, y) in enumerate(train_set) if y == target]) for target in range(num_classes)} Do u know what causes the issue? And I change the implementation of this part. I can pass this part, but I got this. BrokenPipeError: [Errno 32] Broken pipe Traceback (most recent call last): File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 245, in _feed send_bytes(obj) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

Another information: It show me that ot.gpu not found - coupling computation will be in cpu.

sjunhongshen commented 1 year ago

Hi Jingxuan, could you please provide more information about the gpu/cpu you are using? I'm suspecting it's because of some memory issue. Also, can you print len(train_set) to see what's the size of the dataset? Thanks!

JingxuanKang commented 1 year ago

Sorry, I just accidentally closed the issue by mistake. src feat shape torch.Size([5000, 128]) torch.Size([5000]) num classes 100 len(train_set)=50000

The GPU I use is RTX3090(24G GPU memory). Below is the status of the memory. KiB Mem : 32600308 total, 1031116 free, 9692632 used, 21876560 buff/cache KiB Swap: 33553912 total, 33316856 free, 237056 used. 20834040 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5083 root 25 5 22.254g 7.240g 1.240g S 70.9 23.3 0:30.31 python3

sjunhongshen commented 1 year ago

Thanks for the info! I think one way to debug is to expand the for loop for i, (x, y, _) in enumerate(train_set), print out each i and see what's taking so long. For CIFAR-100, this line of code basically loops through all the data and generate 100 data loaders for each class. You can see how long each iteration takes as i goes from 0 to 99.

JingxuanKang commented 1 year ago

I do like this to solve the slow loading problem. try: indices = [[] for _ in range(numclasses)] for i, (, y) in enumerate(train_set): indices[y].append(i) subsets = {target: torch.utils.data.Subset(train_set, indices[target]) for target in tqdm(range(num_classes), desc="Creating subsets")} (This achieves the same functionality as your code) The main problem is what causes the broken pipe problem...

JingxuanKang commented 1 year ago

for i, (x, y, _) in enumerate(train_set) I'm not quite sure what you're trying to say here. If enumerate(train_set), i should be 0-50000. Maybe you want me to make an attempt like below? In fact, it's very slow here, it takes about 10 seconds to get to the next class for target in range(num_classes): print(target) indices = [] for i, (x, y) in enumerate(train_set):

                if y == target:
                    indices.append(i)
            subset = torch.utils.data.Subset(train_set, indices)
            subsets[target] = subset

sjunhongshen commented 1 year ago

I haven't encountered any broken pipe error before. Did you try using only 1 worker for the data loaders? i.e., set num_worker=1 for all torch.utils.data.DataLoader funcs in load_cifar (data_loaders.py)

JingxuanKang commented 1 year ago

I have tried this, but unfortunately still have this problem. Thanks for your help.

sjunhongshen / ORCA

cifar100yaml failed to run #2