pytorch / cpuinfo

CPU INFOrmation library (x86/x86-64/ARM/ARM64, Linux/Windows/Android/macOS/iOS)
BSD 2-Clause "Simplified" License
1.01k stars 319 forks source link

error cpuinfo with pytorch on heroku #20

Closed parthi2929 closed 5 years ago

parthi2929 commented 5 years ago

Hi

I am trying to create a python service, which uses pytorch model (fastai). It runs perfectly fine locally, but on heroku it gives below error (local is win 10, heroku has linux).

2019-01-11T12:39:53.041964+00:00 app[web.1]: Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /app/models/resnet34-333f7ec4.pth
2019-01-11T12:39:55.637199+00:00 app[web.1]: 
2019-01-11T12:39:56.543386+00:00 app[web.1]: Error in cpuinfo: failed to parse the list of possible procesors in /sys/devices/system/cpu/possible
2019-01-11T12:39:56.543453+00:00 app[web.1]: Error in cpuinfo: failed to parse the list of present procesors in /sys/devices/system/cpu/present
2019-01-11T12:39:58.976469+00:00 app[web.1]: Traceback (most recent call last):
2019-01-11T12:39:58.976494+00:00 app[web.1]:   File "pytest.py", line 5, in <module>
2019-01-11T12:39:58.976582+00:00 app[web.1]:     model = FastaiImageClassifier()
2019-01-11T12:39:58.976587+00:00 app[web.1]:   File "/app/pymodel.py", line 22, in __init__
2019-01-11T12:39:58.976761+00:00 app[web.1]:     self.learner = self.setup_model(PATH_TO_MODELS_DIR, NAME_OF_PTH_FILE, YOUR_CLASSES_HERE)
2019-01-11T12:39:58.976766+00:00 app[web.1]:   File "/app/pymodel.py", line 37, in setup_model
2019-01-11T12:39:58.976933+00:00 app[web.1]:     learner = create_cnn(data, models.resnet34).load(learner_name_to_load)        
2019-01-11T12:39:58.976935+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.6/site-packages/fastai/basic_train.py", line 213, in load
2019-01-11T12:39:58.977237+00:00 app[web.1]:     state = torch.load(self.path/self.model_dir/f'{name}.pth', map_location=device)
2019-01-11T12:39:58.977240+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.6/site-packages/torch/serialization.py", line 367, in load
2019-01-11T12:39:58.977548+00:00 app[web.1]:     return _load(f, map_location, pickle_module)
2019-01-11T12:39:58.977550+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.6/site-packages/torch/serialization.py", line 528, in _load
2019-01-11T12:39:58.977966+00:00 app[web.1]:     magic_number = pickle_module.load(f)
2019-01-11T12:39:58.977969+00:00 app[web.1]: _pickle.UnpicklingError: invalid load key, 'v'.

As per my understanding the pickle error was because pickle file download was not successful, and that was because of the error probably as

2019-01-11T12:39:56.543386+00:00 app[web.1]: Error in cpuinfo: failed to parse the list of possible procesors in /sys/devices/system/cpu/possible
2019-01-11T12:39:56.543453+00:00 app[web.1]: Error in cpuinfo: failed to parse the list of present procesors in /sys/devices/system/cpu/present

My app repo is here. Note, I have disabled all other components of the app and testing only python model handling part due to this error, so my procfile says to python pytest.py.

As stated in related issue for AWS 14 here is my heroku dump of /proc/cpuinfo. Kindly help. I tried to use both steady and also preview nightly build but same issue.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 4
initial apicid  : 4
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 6
initial apicid  : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 1
cpu cores   : 4
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 6
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 2
cpu cores   : 4
apicid      : 5
initial apicid  : 5
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 62
model name  : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
stepping    : 4
microcode   : 0x42c
cpu MHz     : 2494.048
cache size  : 25600 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm kaiser fsgsbase smep erms xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 4988.09
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
Maratyszcza commented 5 years ago

@parthi2929 Are you using PyTorch 1.0 release, nightly build, or build from master branch of pytorch/pytorch?

Maratyszcza commented 5 years ago

The cpuinfo issue was fixed ~20 days ago, it shouldn't show up on a nightly or master build of PyTorch (albeit I haven't yet received a confirmation from people who reported it). However, it is very unlikely that failed download is related to cpuinfo.

parthi2929 commented 5 years ago

Sorry I totally missed updating that very important info. I was lastly trying this (requirements.txt file content)

numpy 
torchvision_nightly
-f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
fastai
zerorpc

Earlier I also tried below one, but same issue.

https://download.pytorch.org/whl/cpu/torch-1.0.0-cp36-cp36m-linux_x86_64.whl
fastai
zerorpc
soumith commented 5 years ago

the cpuinfo "Error" here i think is just a warning, if you use the nightly. The real error is the pickling error, but that just points to a corrupted pickle file (maybe download failed)

parthi2929 commented 5 years ago

The repo works locally in my system without any problem. I am now testing with not nightly, but this build. The build was successful in heroku.

Is there a way to verify if and why download of that model failed? There is no error on that, but resnet pth file is not present in the location. My understand was download failed because of cpuinfo, but from your claim, I see that might not be the case. Eventually the pth which should have been downloaded is not in app/models folder in heroku (checked using heroku cli).

Also, as a workaround, is there way I could load a resnet pth myself in custom location in repo (downloaded separately and stored in repo) and ask pytorch to use that instead of trying to download every time (because heroku restarts the dyno, the pytorch may try to download again and again leading to big latency issues).

I have set TORCH_MODEL_ZOO = /app/models in heroku, and had a resnet pth there earlier, but that local file was not anyway considered by pytorch, so I have that removed for now in repo (its another 80+ MB)

Maratyszcza commented 5 years ago

I checked how cpuinfo works without sysfs, and found it works just fine. 9fa0a0520cb8bedd0fe47168f8479f603dc93ddc removes logging of error if sysfs files can't be read on x86. Thus, the issue with file download is unrelated to cpuinfo. I never worked with Heroku, but may assume you need special proxy settings for downloading?