ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
17.32k stars 1k forks source link

Why not implement this in Pytorch? #12

Open ToniRV opened 11 months ago

awni commented 11 months ago

This is a question we get asked a lot. Before I get started on why, let me say we like PyTorch a lot, we use it often, and much of our higher-level neural net library was inspired by it.

Having said that, here are a few reasons we decided to do something new.

Apple silicon first

First and foremost, we wanted to design MLX for Apple silicon. This means, among other things, taking advantage of unified memory. When you make an array in MLX you do not have to specify what device it resides on, you just make it. Operations run on devices without needing to copy arrays. For example:

a = mx.random.normal((100,))
b = mx.random.normal((100,))
mx.add(a, b, stream=mx.cpu)
mx.add(a, b, stream=mx.gpu)

There are no copies of either a or b even though the operations are running on different devices. Furthermore, MLX will run those operations in parallel since they can be run on different devices, and there are no dependencies between them. This seemingly small change to the programming model would be very difficult to implement in PyTorch. And we hope that it will open the door to new and interesting algorithms (for machine learning or otherwise).

Alternative Design and API

We do some things differently than PyTorch. For example, our array API mirrors NumPy, which we think is important. Our function transformations are composable like Jax. Computations are lazy, which can be really useful. But graphs are still built dynamically, which makes debugging easy and changing the sizes of array arguments to functions trivial.

These are all features that we wanted to include. We had the luxury of picking the best from all the frameworks we've used and worked on in the past and combining them into something new.

Simple, Flexible, Baggage-Free

Like everything else in this world, machine learning frameworks obey the second law of thermodynamics — they have a tendency towards increasing entropy. Since AI is still changing rapidly, as frameworks age they get more complex and are harder to improve and adapt. We believe that a new framework like ours, which is still (relatively) pretty simple and hackable, might be a good starting point for people looking to explore new ideas.

For example, tracing through the MLX stack to find or change how things work is doable, even for researchers without extensive experience in framework engineering. Similarly, you can easily add custom Metal (GPU) kernels to try out new ideas or optimizations.

More Exploration, More Diversity

Historically the success of different machine learning paradigms has been pretty tied to the available hardware. It's hard to argue with the fact that the success of deep learning depended a lot on its coevolution with GPUs. Apple silicon has some new and interesting characteristics that may benefit new algorithms. But we need frameworks, i.e. MLX, that let us explore that. In general we think that more diversity for ML researchers both on the hardware side and the supporting software is always a good thing.

Oh, and also @andresy really likes writing array frameworks 😉

holycrypto commented 11 months ago

This is a good start for Mac studio users!

chumingqian commented 11 months ago

No means to offend you, but most of the deep learning enthusiasts around me really love using PyTorch on "Mac OS ". So, here's the situation:

  1. When we need to quickly verify our deep learning algorithms, we use SSH to remotely connect to a server with a PyTorch environment on Mac OS. Then, we debug locally. Once successful, we utilize the GPU on the server for training.

    2.Currently, most projects and top-tier open-source projects are based on the PyTorch framework. Personally, I really don't want to rewrite projects that have already been open-sourced using your framework.

I might understand your approach, but I sincerely hope your team can consider the perspective of developers who use Mac OS and are passionate about deep learning."

dougdew64 commented 11 months ago

I hope that this repo will gain traction among developers who are focused on device-based inference apps.

As such a developer, I'm curious to know how this repo is supposed to relate to the CoreML stack.

joluyckx commented 11 months ago

I think there's certainly space for something like this. Some people (including me) realised that Apple Silicon, today, is the most affordable option for having a proper amount of memory with machine learning. Compare a maxed out Mac Studio M2 Ultra with 192GB ram , it's 8K or so , with a Nvidia A100 40GB, it's already 20-25K, you would need 4 or 5 to match the amount of RAM! That's 8K for apple compared to 100K with Nvidia, to get to a similar amount of RAM. I know, that computation also matters and is somewhat slower on the Mac GPU, but considering the cost difference that's easily compensated by e.g. buying 2 and distributing the learning. And, in my use cases, the memory is quite a bottleneck (visual transformers).

So far I've successfully used this approach of Apple-based machine learning flows with tensorflow-mac . However, an Apple-Silicon native framework that optimises the use of unified memory is awesome. Also, I do like the simplicity and compactness of the framework.

I do hope this won't reduce the effort Apple spends on tf-mac and PyTorch MPS compatibility, but I think there's no reason all of this cannot be combined. Great effort, looking forward to how this evolves!

chumingqian commented 11 months ago

Hello, friend:

I'm glad you've realized this. Apple indeed has the advantage of unified memory. This is a crucial foundation for diversifying the entire field of deep learning. When I first read the approach you presented, it excited me. However, the more I think about it, the more something seems amiss.

Let's put aside the field of deep learning and delve into another topic. Apple, as a company, has certainly diversified the entire IT world. They've created one of the three major operating systems: Windows, Linux, and macOS.

Windows operating system supports both Intel and AMD chips. While profiting themselves, they've shared some profits with other companies, allowing them to gain as well. Simultaneously, software like PyTorch and TensorFlow supports this kind of hardware.

Linux is open-source and fully supported by PyTorch and TensorFlow.

macOS: When using Intel chips, PyTorch and TensorFlow are supported. After discontinuing Intel chips and embarking on their own chip journey, they've laid the groundwork for diversifying deep learning, from software to hardware, utilizing their in-house products.

I'm eagerly anticipating how MLX will perform on the Mac Studio M3. Have you noticed that despite having 36GB of RAM, the highest supported ROM on the Mac M3 Pro is only 512GB? If you take a moment to review historical events, monopolies often result in outcomes that aren't too consumer-friendly.

If they truly aim to diversify deep learning, simply being compatible with PyTorch would suffice.

Sometimes, it's not just about what's said; it's equally important to observe what actions are taken.

joluyckx commented 11 months ago

That's an interesting perspective, but I respectfully disagree with it :).

In the machine learning space, Apple is David & NVidia is Goliath. Their objective is not necessarily to "diversify the deep learning space". Their objective could be, become an important player in this space, and deliver a great value. Only by doing that, by the way, they'll be able to reach that objective. Nvidia has the near-monopoly in this space, not apple.

Based on that, continue your thinking: how can Apple be strong player in this space? Using their traditional model: full vertical integration! Hardware & software optimised hand in hand to create one awesome user experience. That's what MLX is for machine learning. My point on unified memory is specifically about the fact that PyTorch is from the ground up conceived with a separated memory space and copies all around. Maybe unified memory can be "forced" into it, but it would probably be a suboptimal experience overall. In the MLX world things around unified memory are trivially simple, as I've noticed when trying it out.

I'm owning a machine learning team and program in my organisation, and the decision to go with Apple Silicon has saved me about 200K over 2 years (2023 and 2024 projected) and the 2 years after that I expect to save 500K. And this is for a relatively small team/machine learning setup. Models are being trained & being run, I'm getting the value now.

Again, respectfully, I think if you go deeper than the surface on some of the things you mention, you'll see that they turn out to be not as dire as you say. For example, the M3 Pro having 128GB of RAM (you said 512GB of ROM but I suppose you mistyped?). In any case, that's normal because of 2 factors: the memory bandwidth and speed in Apple Silicon is (close to) GPU grade, which is the only way unified memory can work. This memory needs a large bus, controller, ... Depending on the M3 model, this bus is smaller or bigger, and looking at the M2: the M2 PRO has 96GB limit, the M2ultra 192GB (which is what we have). That's not a coincidence: the M2ultra is basically 2 M2's combined. So, the limits stem from the nature of the chips specifications and is not artificial rather it's more inherent. In the end, as I mentioned, the cost is an order of magnitude lower these days for Apple Silicon compared to Nvidia, of course for this application taking the heaviest Apple Silicon there is, is a must. And, for the M3: of course the M3 ultra will still have to be brought out. By the way: in the Mac Studio example, there's also a super speed well integrated 8TB local data (SSD) option, which, when I compared to cloud computing and Azure SSD costs are an order of magnitude faster (again) and an order of magnitude cheaper as well!

Additionally, as someone else remarked, having mlx as an open source library will allow the community of PyTorch to more easily connect that to metal.

So, rather than hypothesising how Apple could go from a detail in the machine learning market to a monopoly (that would be quite a turnaround) and practice unfair behaviour in that market, for now they are upcoming in the space, need to prove a lot, and will only grow when they provide value.

For me, on the hardware level, they've made an almost unbelievable advance in value for money. And, true to their strategy, if they can combine this with a nice integrated software stack that offers simplicity and synergy, they will become a potential great player in that market. It'll also depend if they decide to start offering servers /dedicated ML products etc, that's to be seen.

From an overall market influence perspective, I think you can go to the bank on having one impact before everything else: if Apple goes from "negligible" to being "a small player in deep learning space", Nvidia will be required to adjust their pricing. This is probably the first tangible market impact there would be. Nvidia's prices are ridiculous, their consumer GPU's artificially limited in memory, their server GPU's have more memory (still not very impressive) but are costed beyond any reasonable amount (25K for a 40GB card?). Let alone, that they are purchasable anywhere. This Nvidia near-monopoly is because of their great CUDA and software stack overall.

So, in summary - this will only improve things, you can only judge on what you see, and what you see is Apple trying to deliver more value in this space, I don't see how that can be a bad thing :).

Now I'll go back to porting my latest model to MLX, the first results are awesome, works fluently, super easy to translate, there's a nice data pipeline as well, ... this is just a pleasure to use!

chumingqian commented 11 months ago

I have to say, Apple is really lucky to have users like you. :)

OLH21 commented 11 months ago

Thanks all for this , you guys truly gave a new opportunity to Apple. And thanks for all the good arguments. Happy to have made a good choice buying a silicon machine even before this was out. And god knows I did have a lot of reason not to trust Apple anymore !

wisefool769 commented 11 months ago

No one uses MacOS or Apple Silicon on the server, so it seems like the envisioned future of this library is to develop apps that run natively on MacOS. It's a bit of a shame, because it will be much harder to use this software on other architectures, reducing the potential for libraries built on top of mlx and any software I write with this won't be portable. It would be wasteful to require each MacOS app to bundle its own LLM, so will Apple just bundle an OSS LLM like Mistral into future OS releases? Alternatively, will MLX start to support other architectures as well and directly compete with PyTorch?

In the current state, I imagine the main upside is

easp commented 11 months ago

@chumingqian, it's not luck. For all there many shortcomings, Apple has long offered users things that weren't available elsewhere. People who get caught up in specs tend to be blind to it, and put it down to "marketing."

[...] so it seems like the envisioned future of this library is to develop apps that run natively on MacOS. It's a bit of a shame, because it will be much harder to use this software on other architectures.

@wisefool769 There are multiple options if you want software that runs on other architectures. What would be the point in another framework that runs on other architectures when that limits what's possible (in terms of some combination of performance, memory efficiency and development velocity) on Apple's unified memory architecture?

I hope Apple invests in getting MPS support in PyTorch to the point where GPU accelerated LLM training with existing PyTorch toolkits requires no extra effort on the part of end users. I also hope that between internal and external developers, MLX gets to the point where people think twice before grabbing something from the PyTorch toolbox.

@awni It's worth giving your MLX "manifesto" more prominence, somewhere.

rickypang0219 commented 11 months ago

I hold the same question in my mind, why not combine with Torch. I think one thing that will obstruct MLX development is the Apple silicon, which is also its target audience . As a Kaggler, sometime I will debug code locally and then run the result using GPU provided by Kaggle. Using PyTorch, I can easily switch between MPS/GPU using torch.device( 'mps' if torch.backends.mps.is_available() else 'cuda'), no need to modify the code. However, using MLX I may need to rewrite the code and not efficient as a result. However, I love this package much since it is the one to unleash the full power of Apple silicon/ Unified Memory for Deep Learning.

SamuelMarks commented 11 months ago

Related: https://github.com/keras-team/keras/issues/18961

FYI: The new Keras allows you to mix-and-match PyTorch, JAX, and TensorFlow.

wisefool769 commented 11 months ago

What would be the point in another framework that runs on other architectures when that limits what's possible

I would love to better understand how PyTorch is limiting, beyond a greater level of bureaucracy from committing to an established project.

OLH21 commented 11 months ago

What would be the point in another framework that runs on other architectures when that limits what's possible

I would love to better understand how PyTorch is limiting, beyond a greater level of bureaucracy from committing to an established project.

Architecturewise pythorch cannot deal with unified memory easily so it’s not the best option at all for silicon architecture. Moreover You can still use pythorch and MLC if you like. NVidia did exactly the same thing with Cuda and now nobody complains. Unified memory is the way to have cheaper AI computing and in ten years everyone will use it like everyone use Cuda today. Maintaining parallel dev is the standard as technologies evolve. Break are necessary to pass new levels or we would still have x86 32b.

0x1orz commented 11 months ago

As MLX is designed for the lazy computation and unified memory. I guess that should have some relationship with the paper LLM in flash

edmondja commented 11 months ago

Maybe my question is stupid but, what if instead of making MLX compatible with PyTorch we tried to make cuda compatible with a simplified version of MLX ? This way we could use almost the same code everywhere ?

awni commented 11 months ago

what if instead of making MLX compatible with PyTorch we tried to make cuda compatible with a simplified version of MLX

Interesting question. Most of the METAL backend is abstracted nicely so you can currently compile MLX for CPU only mode. That also means adding another back-end is very possible. With CUDA the main issue I see is how you deal with the unified memory programming model. We don't want to change the API, so what happens in the following:

c = mx.matmul(a, b, stream=mx.gpu) # CUDA gpu in this case
# c is in GPU RAM at this point
d = mx.sine(c, stream=mx.cpu) # Does this silently copy c?

I'm not sure if we have any other option except an implicit copy as adding operations to move arrays between memory spaces to the core API is probably not a good idea.

Crear12 commented 10 months ago

It’s necessary for Apple to have its own easy-to-use framework for Apple Silicon because all third-party frameworks have to consider the compatibility of other platforms, let alone most of them prioritize the development/optimization for other platforms.

0x1orz commented 10 months ago

An alternative frugal kit of openai/triton, CUDA is also necessary for Apple silicon to support the other akin computional fields, like as bioinformatics, computional chemistry.

ramkumarkoppu commented 10 months ago

Is it possible to convert MLX trained model to other frameworks for inference like TFLite, PyTorch mobile, ONX, CoreML for iOS? How easy it is to convert models from MLX to PyTorch or TensorFlow?

SamuelMarks commented 10 months ago

@ramkumarkoppu It's not in https://github.com/onnx/onnxmltools, but there is an open issue here https://github.com/ml-explore/mlx/issues/290 requesting this feature

rupurt commented 10 months ago

Ivy (which is like babel from the NodeJS ecosystem) is also working on adding MLX as a supported framework https://github.com/unifyai/ivy/issues/27458. It should be pretty easy to transpile models across all major frameworks in 2024 soon.

johnnynunez commented 10 months ago

AMD/Nvidia is also working on shared memory, that is different from unified memory that means that you use your ram as GPU memory, because with recent kernels of linux, your memory is HMM https://www.kernel.org/doc/html/v5.0/vm/hmm.html Also AMD has some new APU's and Nvidia with Nvidia Grace have the same idea from apple, unified memory in all system.

It could be amazing that if exists a decorator to convert all pytorch code to mlx.

lucascolley commented 10 months ago

adding operations to move arrays between memory spaces to the core API is probably not a good idea

FWIW, the array API standard has to_device as a method of the array object.

dougdew64 commented 10 months ago

"I thought that the unified memory system on the M-series chips would reduce copying overheads. Perhaps this is not yet the case from a software perspective (e.g. PyTorch and TensorFlow are not designed for Apple Silicon).

Maybe newer frameworks designed for Apple Silicon such as MLX will better utilise the unified memory system. This will require further investigation."

-- from the article

https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2

arnavmehta7 commented 7 months ago

Is it possible to use the MLX compiled/trained models with other frameworks like torch? I.e., can I run an MLX model on Cuda afterwards?

Ang-Wei-Liang commented 5 months ago

Speaking of which, it would be very convenient if there was the invention of a pytorch to MLX code converter, similar to how it exists for language to langauge conversions

jselvam11 commented 2 months ago

Lowering Triton to a Metal backend?