Open ryendu opened 3 years ago
Hey! Thank you for your interest in contributing to this project. I really appreciate it.
The biggest missing thing in DL4S is GPU support. The experimental ArrayFire implementation that I built is far from ideal, initial performance benchmarks are not great and ArrayFire uses OpenCL, which doesn't run on iOS and isn't well maintained on macOS.
Ideally this library would have a way to compile dynamic compute graphs using a combination of MetalPerformanceShaderGraph or MLCompute operations and manual Metal code generation (so basically a neural network compiler operating on dynamic graphs).
Besides that, there are a few missing features and convolution and pooling operations would need some performance tuning.
Missing features besides GPU support that come to mind right now are:
with torch.no_grad():
There's really a lot of stuff that can be added to DL4S. Let me know if you want to work on any one of these or if you find anything else missing that you want to add.
I think I'll first start small with 1D convolutions, and dilated convolutions. Also, in the future, should I make an issue with a proposal of changes I plan to make, make the changes, and then make a pull request, or should I just make a pull request with changes.
Great! You can directly create a PR, no need to create an issue.
You can use the existing code that I wrote for 2D convolutions (im2col/col2im) and introduce the dilation factor. Please make sure to add a few test cases to ensure that your code behaves as expected and that performance is good.
The THNN library behind PyTorch has an implementation with dilation here.
Awesome! Thanks for the tip, I'll get started working on it!
I'm very happy to see that people still want to push Swift forward in writing deep learning algorithms (even after Google stopped S4TF project). 🙂
I definitely want to help! Check out ARHeadsetKit - it makes heavy use of GPGPU and is entirely in Swift. I have an aspiration to revive Swift for TensorFlow, and I had no idea this repo existed!
I'm currently working on tutorials and publicity for ARHeadsetKit, but several months down the road, I think this would be an epic way to delve into machine learning!
Lack of (real) enthusiasm and inability to use S4TF for everyday tasks were the reason why it failed. I'm sure we can change its fate and get something like Tensor<Half, GPU> running on an iPhone!
MTLLayeredBuffer will be the key to making this possible!
Thanks for the interest. As my time resources right now are pretty limited at this time, I am actually quite happy to accept additional maintainers for this project. Let me know if you are interested. Metal support was always on my list, though my experimental metal implementation (feature/metal) isn't up to the task and I unfortunately had to shift my focus to other projects.
There is definitely room to improve on you Metal support. Just scrolling through the only shader file, I can see a lot of areas where the code can be cleaned up. I have a thousand hours of experience using Metal for GPGPU, and I may be the only person who can pull this off.
Again, this is months in the future. However, I might just need a break from AR some day and start exploring your repo.
Also, do you have Google Cardboard?
If you need anything, let me know.
And yes, I do own a few Cardboards.
You could be the first person outside my home to ever experience ARHeadsetKit on Google Cardboard! Would you mind downloading AR MultiPendulum?
App Link: https://apps.apple.com/app/ar-multipendulum/id1583322801
Unless I’m mistaken, this would make DL4S the first ML scripting API besides Python TensorFlow 2.5 to have hardware acceleration on an a Mac, and the first ever for iOS
I've been following this project too. This project has already made a good progress in terms of the constructs of DL algorithm. Yeah, for open source world, it will be first DL library with Metal support on macOS and iOS.
But Apple have few Metal-accelerated DL frameworks like Metal Performance Shaders, Metal Performance Shaders Graph, and ML Compute. And I've worked heavily with Metal & MPS Graph frameworks. Lol.
By the way, about me, I authored Deep Learning with Swift for TensorFlow book this year. I wanted S4TF to succeed (especially Swift in DL world). Although I don't have much time to contribute but I'd love to make contributions whenever possible!
One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype
argument). The same approach is used in MPS where the argument name is dataType
(an enumeration type).
Regards Rahul Bhalley
Unless I’m mistaken, this would make DL4S the first ML scripting API besides Python TensorFlow 2.5 to have hardware acceleration on an a Mac, and the first ever for iOS
There have been other Metal accelerated libraries like Serrano. Though none of the libraries I found have a particularly Swifty API and support the dynamic behaviors that DL4S does.
Also, there are neural network compilers that target Metal, like TVM.
One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides
dtype
argument). The same approach is used in MPS where the argument name isdataType
(an enumeration type).
Having a generic type solves multiple pain points:
 I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.
I would also add support for a Float16 and UInt64/Int64 generic types for GPU, even though it might never be used in practice. I also lack knowledge on this, but there might not be MPS support for half-precision NN operations. @RahulBhalley - could you give me some insight into this since you have worked heavily with MPS before?
Sent from my iPhone
On Oct 28, 2021, at 4:55 AM, Palle @.***> wrote:  One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).
Having a generic type solves multiple pain points:
It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty. It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32). Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models) The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
Also, I snagged the “s4tf” organization name, if you want to join it just for kicks.
Sent from my iPhone
On Oct 28, 2021, at 6:47 PM, Philip Turner @.***> wrote:
I’ve rarely ever used MPS - just for some rare debugging cases when authoring a shader would take more time than calling an MPSKernel. Depending on the nature of how tensors are allocated, I might end up making DL4S a wrapper over MPS. This is just speculation, though, because I haven’t taken a serious look at the source code yet.
I would also add a Float16 generic type
Sent from my iPhone
On Oct 28, 2021, at 4:55 AM, Palle @.***> wrote:
 One suggestion about API design, unlike S4TF where you have to specify the generic type during initialization, you can follow a much more elegant NumPy-style data type specification (which provides dtype argument). The same approach is used in MPS where the argument name is dataType (an enumeration type).
Having a generic type solves multiple pain points:
It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty. It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32). Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models) The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
If there is a massive overhead to communicating with a discrete GPU on Intel Macs, would it be acceptable to have something like “iGPU” and “eGPU” device names (in addition to CPU and GPU)?
The reason is that integrated GPUs might be faster than either a discrete GPU or a CPU on some models, and developers would have an easy way to choose either one if available (e.g. Tensor<Float, GPU>
-> Tensor<Float, eGPU>
). If an eGPU is not available, it would act the same as if just saying “GPU”.
The main difference would simply be that with iGPU, certain buffers use system shared memory, while with eGPU, they are GPU-private. You might want to mix and match CPU and GPU tensors as well on certain models, which further adds the need for flexibility in workflows.
Alternatively, this might be resolved by setting a static type property on GPU (whether integrated/external is preferred) - although that isn’t thread safe and takes away from the flexibility I mentioned above. I just want to throw around ideas and see what options I have to explore a few months down the road.
I might end up making DL4S a wrapper over MPS.
That'll be amazing.
I would also add support for a Float16 and UInt64/Int64 generic types for GPU, even though it might never be used in practice. I also lack knowledge on this, but there might not be MPS support for half-precision NN operations. @RahulBhalley - could you give me some insight into this since you have worked heavily with MPS before?
@philipturner you should be able to add FP16 precision via MPS (see MPSDataType). I haven't used low-precision computation in MPS but you should be able to provide UInt8-based quantized computation like PyTorch.
Also, I snagged the “s4tf” organization name, if you want to join it just for kicks.
Sure, I would love to join it.
Having a generic type solves multiple pain points:
- It eliminates some types of errors at compile time. The type system works to our advantage. Erasing the type wouldn't be Swifty.
- It allows for different overloads to be chosen at compile time. For example, subscripting with a vector of booleans (UInt8) would result in a different method call than subscripting with an index vector (Int32).
- Having the data type available at compile time allows for generic specialization, which reduces overhead (which is one of the key reasons why DL4S outperforms other libraries on the CPU in certain models)
- The data type of a tensor conveys information about its role to the developer. A float tensor is used in other ways than a integer tensor.
Sure @palle-k.
I was just concerned about that I had to write Tensor<Float>
every time I initialized a variable in S4TF. So I always used to write typealias TFloat = Tensor<Float>
. It just felt cleaner. I think S4TF team was also talking about improving this syntax.
But never mind, it's fine if generic type specification give advantage to compiler-level optimization.
I sent you an invitation to the organization!
Thanks, joined.
@RahulBhalley do you have enough time to try learning my framework ARHeadsetKit? It requires Xcode and I have 3 hours worth of tutorials so far. The people who have already tried the tutorials said they were really fun, and I think you might enjoy them as well.
Before you read this comment, note that it's monstrously large and maybe a bit overboard on detail.
I have some experience emulating double precision on GPUs, in a past attempt to accelerate MultiPendulum. The attempt failed because the dynamic range was the real problem, not precision (in the end, I couldn't even use two parallel CPU cores to accelerate the simulation because of cache incoherence). However, I gained a lot of experience with emulating higher-precision arithmetic.
Metal does not allow double-precision arithmetic, even on GPUs that support it natively. Under emulation, it would take dozens of GPU clock cycles to do a single addition or multiplication. However, devices with native double precision already have a fraction of the throughput of single-precision, so it might actually have comparable performance.
On devices with a weak GPU or workloads that underutilize the GPU, it will be faster to run double-precision arithmetic on the CPU. This would most likely be the case on the Intel Mac Mini and iPhone 8.
However, when the GPU is massive like the M1 Max or AMD 6000 series, running large-scale double-precision computations could be faster than the CPU, even under emulation. There might be some special use cases for DL4S where Tensor<Double, GPU> could benefit performance. For example, it could offload work from the CPU in a mixed CPU-GPU tensor workflow. This means there is a reason to bring Tensor<Double, GPU> to this framework.
Side Note: What are your opinions on mixing CPU and GPU tensors in DL4S? It might require redesigning some code internal to DL4S, which I haven't had the chance to study yet.
A few months down the road, I might make a Metal library for emulating double-precision or quad-precision on iOS and macOS GPUs. The setup for compiling the GPU shaders is very complicated (see the "Xcode Framework" option for ARHeadsetKit's configuration). So, I would post the open-source MSL code in this repository, but compile it locally and upload the .metallib
file to DL4S. At runtime, DL4S would load shader functions from the library.
The double-precision shaders would have to be compiled separately anyway. This is because Metal's fast math (-ffast-math
) breaks accumulation of rounding errors, which are crucial to emulating double precision. Furthermore, this workflow eliminates the headache of linking an ARHeadsetKit-like Metal library to DL4S.
Having support for 64-bit integers and 64-bit floating point numbers on the GPU could be a unique advantage to DL4S, since not many ML frameworks have it outside of the CPU (if any). Just having that feature might motivate people to try DL4S when they wouldn't otherwise, or make DL4S vital to some rare ML use cases.
With bfloat16, it is proven that the bottleneck for ML is dynamic range, not precision. Emulated double precision will not let you expand past 8 bits of exponent. If developers expect otherwise, they may meet some unfortunate surprises (it broke MultiPendulum).
We would need to at least briefly document this behavior if I add support for Tensor<Double, GPU>. My personal repository for double-precision arithmetic would go into greater detail with edge cases and performance limitations. @palle-k, would this scenario be acceptable for DL4S?
@RahulBhalley could you make your s4tf membership public? I can't change your visibility myself. Here's a picture in case you aren't familiar with the process:
Done.
I may also add DocC documentation to enhance the developer experience in Xcode and help with the documentation hosted on Jazzy. I would also make some super awesome Xcode tutorials for this framework, like the ones for ARHeadsetKit. What do both of you think about this?
I wish that @palle-k would at least respond to say he read my comments. I have thrown several major ideas there and asked for his feedback, but all I have got is silence. I really don’t mean to be rude, but could you at least add a comment to this thread if you have the time?
@philipturner Sorry for not replying that often, I really wish I had more time to invest in this project.
All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks.
I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting.
Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it.
As you are familiar with GPU programming I would assume your time brings the most value there.
Thank you so much!
Sent from my iPhone
On Nov 3, 2021, at 5:23 AM, Palle @.***> wrote
 @philipturner Sorry for not replying that often, I really wish I had more time to invest in this project.
All these ideas sound great. Though I guess the most important part would be getting the basic GPU computation pipeline running to the point where we can get performance comparable to other GPU accelerated frameworks.
I've also listed a bunch of things that need to be done further up this thread. These especially include improved subscripting.
Double precision is rarely used in ML. I don't have any specific use cases that would benefit from this amount of precision. However, if it is something that is important to a specific use case of yours, then feel free to implement it.
As you are familiar with GPU programming I would assume your time brings the most value there.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I expect to begin work on this sooner than I anticipated - November or December 2021. @RahulBhalley, generally how busy are you compared to @palle-k? I might benefit from having you routinely validate my commits and contribute ideas.
I’ll prioritize getting it up and running first, postponing double and half precision until a time when I have absolutely nothing else to do. I looked over the introduction a second time, and there is a lot of work we need to do to get GPU up and running.
Everything from logging on DL4S-Tensorboard to the training algorithm needs to heavily modified for GPU, although the public API will not change. From Apple’s documentation, it isn’t clear if MPS has good acceleration for RNNs. @RahulBhalley, what is your experience on this? I will still research its performance, but knowing your feedback will help me prioritize the tasks of creating a custom graph compiler and RNN shaders.
RNNs will be especially difficult, but GPGPU is all about thinking outside of the box to maximize ALU utilization. I may look into performing multiple batches simultaneously, pipelining, etc. We may also need to predict the time span of command buffers to split up work correctly. Too short may waste time in kernel calls, while too long could cause Metal to abort a command buffer. I’m not 100% sure, but that might be an iOS-specific problem.
@RahulBhalley, what do you think about these ideas for implementing Metal in DL4S?
@RahulBhalley, generally how busy are you compared to @palle-k?
I will be busy but can help with commits validation.
I’ll prioritize getting it up and running first, postponing double and half precision until a time when I have absolutely nothing else to do.
Absolutely, I also believe getting float 32-bit precision tensor ops to run on GPU/NPU should be a priority for now.
From Apple’s documentation, it isn’t clear if MPS has good acceleration for RNNs. @RahulBhalley, what is your experience on this? I will still research its performance.
I have experience with convolutional networks using Metal Performance Shaders Graph framework. Never used RNNs with this. But I just wanna point out, I wrote a UNet in MPS Graph and the inference speed difference between CoreML (ofc we can't use this format, it's a different world) model and MPS Graph was huge even though CoreML uses MPS in backend.
Thanks for the information. That model operates on audio, which is a textbook case of an RNN. From the image you showed, Apple didn't put in any real effort into RNNs with MPS. Python TensorFlow actually called MPS under the hood, so they have similar performance. However, Core ML has an entirely different runtime, which clearly optimizes RNNs for the GPU.
CoreML was also made by Apple, which shows that MPS is only slow because they don't care to accelerate it (typical of how Apple treats Metal developer tools). You need to pipeline the data to fully utilize the GPU, which might be easier in inference mode—what CoreML is designed for.
It also isn't the fault of the GPU itself. As demonstrated by the 5th and 3rd logs, CPU slows down the model and NPU has no effect. That means GPU is faster than CPU, and CoreML didn't even touch the NPU.
I just got an M1 Max MBP, which has basically infinite computing power for some applications. It runs 8192 32-bit operations per clock cycle, and we need pipelining and concurrent execution of training data within a batch to even make a dent in that number. We will definitely be making a graph compiler from scratch.
That model operates on audio, which is a textbook case of an RNN.
I think you misunderstood (or maybe not). This model (UNet, a convolutional neural network) does operates on an audio in this case but a spectrogram representation i.e. image not sequence. See the input tensor dimensions are (16, 512, 1024, 2)
, 16
batches, (512, 1024)
height and width, and 2
channels.
From the image you showed, Apple didn't put in any real effort into RNNs with MPS.
So this experiment was not linked to RNNs in any way but only convolution ops.
I did misunderstand. I just looked at it running on audio. I don't really know my terminology when it comes to ML.
Even though it didn't run RNNs, that data highlights inefficiencies in Apple's implementation of CNNs. The reason I was concerned about RNNs is that they are small, which poses problems with the underutilization of GPU ALUs.
Other than that, the only real concern is that CoreML prefers to run them on the CPU, which is a bad sign for GPGPU. That may only be a problem for inference mode, where you aren't running samples in a batch simultaneously to maximize ALU utilization.
I assume you gave me the closest log you had to RNNs that you could find. Is there any other knowledge you have about the topic that might help me?
Hm, i think it could be interesting to have a dynamic device, at least on Apple Silicon with its shared memory, where we would select the best device (CPU, GPU, NPU) automatically so that best performance is achieved for a given model.
Though I guess that could be difficult.
The dynamic device is good, and I was thinking along those lines previously. However, the NPU is a big problem, and it won't work well. You can't access it quickly through any API—the best you can get is to kindly ask CoreML to use it, but you have to compile an MLModel first which has a massive overhead.
Also, the GPU outperforms the NPU on 16-bit (but not 8-bit) math on M1 Pro/Max. And, the NPU doesn't do 32-bit math.
I could hack into CoreML and bypass protections on MLModels, letting DL4S update weights without recompiling it. I would have to be careful about cache synchronization, and it is not a feature Apple would allow in an app that goes on the App Store. Although I've always wanted to try doing that.
However, the NPU is a big problem, and it won't work well. You can't access it quickly through any API
In ML Compute framework, we can access Apple Neural Engine (ANE) device (if available) and run the neural network on it directly. See MLCDevice. That's how Apple's TensorFlow fork runs on GPU/NPU of M series Macs because it's built on top of ML Compute in backend.
Maybe we can use ML Compute in backend (instead of Metal Performance Shaders Graph) for GPU/NPU acceleration but I think this will probably restrict DL4S to ML Compute's pre-defined neural layers and tensor transformation ops. (I couldn't create new layers by inheriting MLCLayer class). Whereas MPS Graph would allow the extension of MPSGraph to add new ops built on top of pre-defined MPSGraph ops.
No way! They just added that in iOS 15! That’s awesome!
However, the ANE’s inability to do 32-bit math means it can’t be used for general-purpose learning. It’s in MLCompute so the framework can be used for inference, as an easier alternative to CoreML.
How does Metal Performance Shaders use ANE? I don’t see that anywhere in the documentation.
How does Metal Performance Shaders use ANE? I don’t see that anywhere in the documentation.
No, ML Compute uses ANE.
Being able to retrieve data from a shared MTLBuffer (compared to private) has multiple benefits:
However, I also have counter-arguments to those reasons:
Shared memory is insanely slow on discrete GPUs, which poses a problem. I might optimize everything for Apple silicon, which means the code also runs well on integrated Intel GPUs. Since I use an Apple silicon Mac, it would be much easier to not worry about supporting discrete GPUs when creating the initial Metal support.
After Metal works for the first time, I would then add support for using private Metal buffers, with a separate execution path optimized for discrete GPUs. Once supported by DL4S, internal code would switch to using the discrete GPU by default. However, there would be a way to use the integrated one instead (e.g. modifying a static property on GPUDevice).
On the Intel Macs, the integrated Intel GPU has its own private memory despite being on the same chip as the CPU. So, there should be an option to choose shared/private MTLBuffer execution path there, with private being the default (once supported). That path should be available on Intel iGPUs, but not Apple on GPUs (DL4S will just ignore it if you select private).
In summary, the following will likely take place in the implementation of Metal into DL4S:
iGPU
, or dGPU
What do you all think about this?
Small change to my above plans: you can select dGPU upon release. You might want to compare iGPU vs dGPU performance, or some ops might be mostly ALU intensive or have weights fall into L1/L2 cache on dGPU. However, iGPU will be selected by default since private MTLBuffer execution doesn't work yet.
Another change: I saw a potential use for having a buffer's contents as UnsafeMutableRawPointer: scanning a tensor to see whether any of its elements are NaN. Unless a buffer is extremely large, the cost of sending commands to the GPU outweighs the benefit of GPU acceleration.
The GPU type may be semi-dynamic in some situations, where basic operations like what I described above will clearly run faster on the CPU. The need to offload these kinds of operations from the GPU back to the CPU depends on where they are located in a graph. If they're sandwiched between two other GPU operations, keep them on the GPU.
In my personal for of DL4S, I'm going to rely on ARHeadsetKit's utility functions for a while. I need to sort out some solution for importing them, whether that's copying and pasting, or separating them into another framework. Relying on it as-is would risk a bad commit to ARHeadsetKit ruining DL4S, and force upgrades to the following:
I have plenty of time to figure this out. Just don't be alarmed when you see that in the Swift package.
@palle-k if none of what I described sounds like an issue to you, then that may save me time when implementing Metal support.
@palle-k, you always put a space between ranges in your source code (e.g. 0 ..< 2
). That isn't the convention practiced most often in Swift.
May I change all instances of this to 0..<2
? If not, I'll use your convention of 0 ..< 2
in the code I make for DL4S. I'd like to have one common convention across the framework.
I'm going to start work on implementing Metal support now. It will take days to fully absorb your code, so you have plenty of time to respond. I expect my first task to be splitting up the Metal shaders into multiple files.
I will reformat the kernel parameters lists to match the style in ARHeadsetKit. The style is much cleaner and groups related parameters near each other. However, I may make the modification of putting parameter lists on the second line of the function declaration, reducing the horizontal size of source code files. Since I'm no longer working on ARHeadsetKit, I can modify my convention a little bit.
@philipturner I am not very attached to either formatting option. Also, I started this project but with more and more community contributions, I am very much open to decisions made by others.
I'll add Swiftlint to this project, so we'll have consistent code style across all contributions.
@palle-k What is your opinion on permanently importing ARHeadsetKit?
@philipturner I am not very attached to either formatting option. Also, I started this project but with more and more community contributions, I am very much open to decisions made by others.
I just modified the range literal style on my branch.
I'll add Swiftlint to this project, so we'll have consistent code style across all contributions.
Hopefully, we can get @ryendu contributing again. I tried reaching out to him, but I haven't gotten a response yet.
@philipturner what would your rationale be for importing ARHeadsetKit? I feel like it should be the other way around, because ARHeadsetKit focuses on a particular type of application, whereas DL4S should be as general purpose as possible.
I saw that ARHeadsetKit has some generally applicable extensions to Metal, which can be useful for DL4S. Wouldn't it be the best way to factor out this core from ARHeadsetKit into some separate library and then make ARHeadsetKit and DL4S depend on this library?
Also, any opinions on creating a DL4S organization here on GitHub?
Hey there @palle-k, I really love this project and I think it is really awesome! I'd really like to help out and contribute, but DL4S seems pretty complete to me. Is there anything that I could help with? I'm really familiar with Swift, and I'm a bit new to open source.