Improve binary release for PyTorch domain library

zhangguanheng66 commented 5 years ago

🚀 Feature

Based on the retro meeting following PyTorch 1.2.0 release, the team agreed to improve binary release process across PyTorch domain libraries:

torchvision @fmassa
torchtext @zhangguanheng66
torchaudio @vincentqb @jamarshon

A few general points:

Clear timeline for release to avoid the last minute stress.
Unified release scripts for domain libraries. Some common scripts could stay in the repo and are checked out for release branches.
Ship nightlies in the future?
Standardize the binary testing with a test suite
A general guideline for testing quality

Binary release for Windows

Better binary support for torchvision. Support Windows for torchtext, torchaudio
The Windows binary is actually under high demand, based on the feedback from torchvision
Nightlies?

CC @soumith @cpuhrsch @ezyang @peterjc123

cc @ezyang

zhangguanheng66 commented 5 years ago

A general guideline for testing quality:

Break free from the classical roles and responsibilities
More exhaustive regression testing, in addition to the Travis CI tests
Performance tests
Cover the relevant environment from users’ side

ezyang commented 5 years ago

Ship nightlies in the future?

Nightlies are already being generated for torchvision and torchaudio. For example, see https://anaconda.org/pytorch-nightly/torchvision/files

zhangguanheng66 commented 5 years ago

Can we also have the nightlies for torchtext? :)

ezyang commented 5 years ago

Sure, just copy paste the code accordingly ;)

peterjc123 commented 5 years ago

@ezyang Isn't torchtext a Python-only package?

peterjc123 commented 5 years ago

I tried to trigger a nightly job for torchvision on Windows. The conda jobs passed, while the wheels jobs are currently blocked by https://github.com/pytorch/pytorch.github.io/pull/244/files#r316548499. However, the biggest question is that on which machines are we going to build the binaries with. Are we going to rely on the hosted agents of the online CI or our own agents? The latter way is currently used when we build nightlies for PyTorch.

zhangguanheng66 commented 5 years ago

@ezyang Isn't torchtext a Python-only package?

@peterjc123 torchtext is a python-only package now. However, there is a PR for C++ extension (basic_english_normalize function), and we plan to have a C++ dictionary this half (depending on the results of sentencepiece binding)

ezyang commented 5 years ago

However, the biggest question is that on which machines are we going to build the binaries with. Are we going to rely on the hosted agents of the online CI or our own agents? The latter way is currently used when we build nightlies for PyTorch.

Well, in Linux, we rely on hosted CircleCI for the binaries, and this is probably going to continue to be the case. I'm not too sure about Windows though; I think we should whatever you, @peterjc123, thinks makes the most sense.

peterjc123 commented 5 years ago

@ezyang Could we have some tests that trying to build binaries on WS 2016 and use it on Win7 or WS 2012 R2? If it works, then we can start to build window containers instead of configuring environments before every build. also cc @yf225

ezyang commented 5 years ago

Yes that SGTM. Is there something specific you would like me to do to try to make this happen? One thing that seems possible is to resurrect the WS 2016 Windows AMI and try to shift the CI over to it (since we now know that switching to ninja fixes the build failures.)

peterjc123 commented 5 years ago

I guess I will need two EC2 machines, one with WS 2008 r2 and one with WS2016/2019.

ezyang commented 5 years ago

Assigning myself for Windows EC2 machines. Do you need GPUs on these too?

peterjc123 commented 5 years ago

@ezyang Yes, I just want to tests the CUDA binary compatibility between these OSes.

ezyang commented 5 years ago

WS 2008 may not be so easy; I literally cannot get Packer to log into the WS 2008 base image (I'm using

      "source_ami_filter": {
        "filters": {
          "name": "Amazon/Windows_Server-2008-R2_SP3-English-64Bit-Base-*"
        },
        "owners": ["956863127205"],
        "most_recent": true
      },

ezyang commented 5 years ago

@peterjc123 Are machines booted from the stock images acceptable?

peterjc123 commented 5 years ago

@ezyang Sure.

peterjc123 commented 5 years ago

However, the biggest question is that on which machines are we going to build the binaries with. Are we going to rely on the hosted agents of the online CI or our own agents? The latter way is currently used when we build nightlies for PyTorch.

Well, in Linux, we rely on hosted CircleCI for the binaries, and this is probably going to continue to be the case. I'm not too sure about Windows though; I think we should whatever you, @peterjc123, thinks makes the most sense.

Oh, I just found out that I missed that post. I think it will be better if we could have our own nightly build machines. Because the parallelism of Azure Pipelines is only 10x. It will block the CI tests of the main repo during the building process of nightlies. Currently, we are depending on the three build machines provided by Microsoft, which we could not directly take control of.

ezyang commented 5 years ago

Because the parallelism of Azure Pipelines is only 10x.

Our own nightly build machines are possible. However, we might be able to increase the parallelism of Azure Pipelines. Let me talk to the relevant people.

ezyang commented 5 years ago

@peterjc123 has got his Windows machines, unassigning myself.

peterjc123 commented 5 years ago

@ezyang I managed to run some simple smoke tests using the CUDA binary on WS 2008 R2 that is generated on WS 2016. But when I tried to run test_cuda.py, there are many unspecified launch errors, but this is the same for our current 1.2.0 binaries.

ezyang commented 5 years ago

Hmm... I wonder if a build from source on WS 2008 would be OK. But that doesn't sound promising :(

peterjc123 commented 5 years ago

@ezyang No need to test this, it's probably related to the TDR settings. Also, this is too old, even not listed in the supported OS in the CUDA 10 document.

peterjc123 commented 5 years ago

@ezyang Is it possible to get a Win7 AMI on EC2?

ezyang commented 5 years ago

https://www.quora.com/Can-we-launch-a-Windows-7-instance-on-AWS-If-so-what%E2%80%99s-the-whole-process seems to imply it's not possible. You'll probably have to VirtualBox it or something :/

peterjc123 commented 5 years ago

@ezyang Okay, I just tested the binary on WS 2012 R2, it seems it is working fine and the CUDA tests passed. ~It makes wondering what is the difference between the CUDA_win10_setup.exe and CUDA_win_setup.exe. Someone mentioned that there's something related to WDDM, but if that's true, how can we use these libraries on a different OS.~

ezyang commented 5 years ago

Nice! So that means we can use CircleCI for binary builds? (Also, did you see their message that they have beta GPU support on Windows now)

peterjc123 commented 5 years ago

@ezyang Yes, we could use WS 2019 for building binaries. BTW, could you please apply https://github.com/pytorch/audio/pull/219 to pytorch/vision so that I can learn how to use CircleCI on Windows quickly?

pytorch / pytorch

Improve binary release for PyTorch domain library #24859

🚀 Feature