pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
https://pytorch.org/examples
BSD 3-Clause "New" or "Revised" License
22.24k stars 9.52k forks source link

TypeError: can't pickle Environment objects #526

Open dbrivio opened 5 years ago

dbrivio commented 5 years ago

Hello,

I'm trying to run the dcgan/main.py file to train a GAN. I'm using a Windows 7 system with python 3.7 (anaconda)

I run the following line %run main.py --dataset lsun --dataroot bedroom_train_lmdb/ --niter 1

and I got the following

Namespace(batchSize=64, beta1=0.5, cuda=False, dataroot='bedroom_train_lmdb/', dataset='lsun', imageSize=64, lr=0.0002, manualSeed=None, ndf=64, netD='', netG='', ngf=64, ngpu=1, niter=1, nz=100, outf='.', workers=2) Random Seed: 482 Generator( (main): Sequential( (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace) (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace) (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace) (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (13): Tanh() ) ) Discriminator( (main): Sequential( (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (1): LeakyReLU(negative_slope=0.2, inplace) (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (4): LeakyReLU(negative_slope=0.2, inplace) (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (7): LeakyReLU(negative_slope=0.2, inplace) (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (10): LeakyReLU(negative_slope=0.2, inplace) (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False) (12): Sigmoid() ) ) Traceback (most recent call last):

File "Y:\Research\Davide\ML\GAN\lsun-master\main.py", line 210, in for i, data in enumerate(dataloader, 0):

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 819, in iter return _DataLoaderIter(self)

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 560, in init w.start()

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self)

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj)

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj)

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child)

File "C:\Users\db396\AppData\Local\Continuum\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj)

TypeError: can't pickle Environment objects

It must be something related to windows. Any suggestions about how to solve this issue? Thanks

soumith commented 5 years ago

add the following lines to the end of the imports section, right after: import torchvision.utils as vutils

if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn')
soumith commented 5 years ago

ideally the script needs to be refactored to push everything into a main() function (that's the original problem)

dave7895 commented 4 years ago

I have the same error. With a different file but about the same.

yptheangel commented 4 years ago

I am seeing this when running it on Windows 10, it is solved when I set num_workers=0 for the DataLoader()

sydney0zq commented 4 years ago

@soumith Hello, I have the same issue. And I have tried to set the multiprocessing start method to spawn, but it has no difference and the error still exists.

Could you please tell me another way to solve it?

rsqai commented 4 years ago

I am seeing this when running it on Windows 10, it is solved when I set num_workers=0 for the DataLoader()

Perfect solution, but what are the specific reasons~

jerinphilip commented 4 years ago

@soumith Can you elaborate on the issue here? The common factor in my code with this code for me is LMDB, and it produces the exact same error. Does this have something to do with trouble pickling the lmdb instance?

jgoodson commented 3 years ago

The issue is that you cannot pickle LMDB env objects. Setting num_workers=0 prevents the need to pickle anything since the main process original object handles retrieving data.

The real solution is to store the Environment variable in a class with a custom getitem() and setitem() functions that delete the LMDB Environment variable from the returned dictionary and then regenerate it when loaded.

pritamqu commented 3 years ago

I am seeing this when running it on Windows 10, it is solved when I set num_workers=0 for the DataLoader()

you saved me, man!! thanks.

shoutOutYangJie commented 2 years ago

I find some Github repos which use both LMDB and num_workers, and finally, successfully work. But I don't know why? You guys can find the examples here. stylegan2 dataset

clipBert dataset

@jgoodson @neillbyrne @ruotianluo @airsplay the solution

HelloWorld-1017 commented 3 months ago

I am seeing this when running it on Windows 10, it is solved when I set num_workers=0 for the DataLoader()

Perfect solution, but what are the specific reasons~

As in PyTorch documentation, torch.utils.data: Platform-specific behaviors — PyTorch 2.3 documentation, the implementation of Python multiprocessing relies on different functions on different platforms: fork() on Unix, while spawn() on Windows and MacOS. I guess by default, PyTorch will call fork() when num_workers is not zero (that is using multiprocessing), causing this error on Windows systems. And one way to solve it is to wrap the code including dataloader iteration under if __name__ == '__main__':.

image