pyodide / pyodide

Pyodide is a Python distribution for the browser and Node.js based on WebAssembly
https://pyodide.org/en/stable/
Mozilla Public License 2.0
11.71k stars 789 forks source link

Better integration with conda/conda-forge for building packages #795

Open rth opened 3 years ago

rth commented 3 years ago

The idea to rely on conda-forge for building Python packages to WebAssembly has been mentioned for a while now (https://github.com/iodide-project/pyodide/issues/38#issuecomment-410270067, https://github.com/conda/conda/issues/7619, https://github.com/regro/cf-scripts/issues/1052#issuecomment-651995625), and in this issue I wanted start a discussion about current situation and existing challenges to move in that direction from the perspective of pyodide, as I understand it (please correct me if needed).

First the main motivation is that the present way of building all the packages in one repo is not sustainable with the increase of the number of packages and the associated increase in CI time. To resolve this we would need significant development resources, which we don't have. Even if we did, it would amount to doing many things (including a community) that already exist and work great at conda-forge, which wouldn't make sense.

Now as to challenges (it's a long post),

1. Updating emscripten

With a single repo it's relatively fast to rebuild all packages with a different version of the emsdk toolchain (emscripten, binaryen, ..) or different options. We currently still have a couple of patches applied to emscripten, and we also ideally need to update emscripten frequently to benefit from improvements and fixes (currently 1.5 years late with respect to the latest release, unfortunately). In conda-forge rebuilding all the packages with a new compiler would take longer (though this got better recently https://github.com/regro/cf-scripts/issues/1052#issuecomment-652003811). Also the use-case where a) we update emscripten version b) some package fails to build c) we have to go back and change some global emscripten settings would really be unpractical I think.

This would hopefully become less of an issue with time as emscripten becomes more and more stable, but it's still an issue now (see e.g. https://github.com/iodide-project/pyodide/pull/480#issuecomment-635823062)

2. Build approach

The cross-compilation of scientific Python packages (based on distutils) is difficult (https://github.com/scipy/scipy/issues/8571, https://github.com/numpy/numpy/issues/17620) as far as I understand, even on Linux between different architectures.

I'm not sure if this was the reason, but pyodide doesn't do cross-compilation in the classical sense. Instead it compiles the package with the host compilers, stores a log of all executed compilation commands and re-run those commands with the emscripten compiler.

3. Shared package specifications

Package specifications where chosen as close as possible to the meta.yaml in conda, and hopefully soon the package index will also use the same format (https://github.com/iodide-project/pyodide/issues/791)

4. Artifacts format

Currently each package consist of 2 separate (.data, .js) files which we distribute via jsDelivr. Those would probably not fit as conda artifacts, which would mean that we likely need to handle some of this in any case.

5. Dependency resolution in the browser

There are two use cases for pyodide,

  1. interactive (notebooks, etc) where having a dependency resolver in the browser (e.g. mamba) would be great.
  2. python applications, where dependencies are known in advance, and we certainly don't want to do dependency resolution at each page load. There having a precomputed list of packages is likely the way to go.

Either way we also need to install pure python wheels (from PyPi or other custom location), so we still have this duality between conda/pyodide packages and Python wheels as well. Meaning we have to maintain a minimalistic pip (micropip) in pyodide.

I haven't followed close WebAssembly related developments at conda-forge, maybe I am missing something.

cc @wolfv @jakirkham

isuruf commented 3 years ago

We have cross compilation support in conda-forge. We cross compiled numpy, scipy, matplotlib for osx-arm64 platform. First thing to do would be to add a platform to conda and conda-build. something like wasm-32.

jakirkham commented 3 years ago

Thanks for writing this up Roman! 😄

This sounds like a good summary to me. I think going from 3 to 4 has been the step I've always been pretty fuzzy on (though I'm guessing you know this pretty well 😉). How does one turn a Conda package tarball into something that pyodide can use? Where do these get hosted?

As Isuru's point about adding a platform in Conda, I think this was the idea behind issue ( https://github.com/conda/conda/issues/7619 ), but may be missing things here.

One other thing worth discussing is over time LLVM's ability to build WASM binaries has grown. Does it make sense to start using that or is Emscripten still needed for some things? The one thing that I recall use to be a stumbling block was a libc implementation for WASM. Though it looks like this has since been solved? In which case maybe building this libc binary gives us enough to get started?

rth commented 3 years ago

Thanks for the feedback @isuruf @jakirkham ! Great to know that cross compilation works conda-forge.

I think going from 3 to 4 has been the step I've always been pretty fuzzy on

As far as I understand those (.data, .js) are files to unpack in the virtual filesystem and associated unpacking instructions (cf emscripten docs). We could potentially repackage them into a single file during build then separate them on the client side.

The question for me is more is the conda artifacts really the best format for distributing such files (which were not optimized for this use case). Maybe there are some other approaches in the JS/WASM ecosystem that would be better. I haven't studied the question in detail so far.

At present, we put the artifacts in an S3 under versioned paths and CDN proxy it by JsDelivr, which allows for a lot of flexibility.

One other thing worth discussing is over time LLVM's ability to build WASM binaries has grown. Does it make sense to start using that or is Emscripten still needed for some things?

Absolutely, it would be good to revisit if the situation evolved much since the discussion in https://github.com/conda/conda/issues/7619

The one thing that I recall use to be a stumbling block was a libc implementation for WASM. Though it looks like this has since been solved?

The link you provide is for WASI or outside of the browser, as far as I understand though? Within the browser, we still need an in-memory filesystem (currently provided by emscripten) as CPython wouldn't be very useful without it. For instance Rust can be easily compiled to WASM (without emscripten), but then you cannot do filesystem I/O at least as indicated in the reference materials. Maybe there are indeed lighter projects providing the necessary abstractions, it could be worth checking.

wolfv commented 3 years ago

I have just updated binaryen (https://github.com/conda-forge/binaryen-feedstock/pull/38) and added an Emscripten recipe to conda-forge (https://github.com/conda-forge/staged-recipes/pull/13178). With those two things we could be ready to explore these ideas.

I'd be happy to integrate cross-compilation to javascript "natively" into boa so that it becomes very straight-forward ... although I am not sure how simple that will be.

I think conda is more than the package format, and that should not be the thing holding us up -- we can easily create a new, more web-optimized package format (different or no compression etc.) if we have to. What I think is great is the community, the available packages, automation, and the build infrastructure by conda-forge etc.

Regarding libc, I was under the impression that emscripten did some magic to get me a libc :) in general, emscripten seemed to work quite smoothly.

Personally, I'd be quite interested in getting the packages in boa-forge to compile to see if we can bootstrap a wasm-micromamba.

rth commented 3 years ago

I have just updated binaryen (conda-forge/binaryen-feedstock#38) and added an Emscripten recipe to conda-forge

If you don't use emsdk you might need to package it manually which doesn't sound that simple (but they did it in Homebrew)

wolfv commented 3 years ago

It's not very difficult to package emscripten - I have a working recipe that I have been using to build wasm conda packages. It's surprisingly simple, and I already got working libraries for a bunch of compression algorithms that are of interest to us, as well as the first half of libsolv (I managed to run it to parse conda repodata.json into a solv file.

So it all sounds quite doable (including running a micromamba in the browser). For that we'd need to rewrite mamba a bit though, to use the emscripten Fetch and FileSystem APIs instead of curl. We could give it a shot though. Alternatively we could figure out how make a mamba-derivative that one can feed repodata.json through javascript and that spits out the packages to download, and then unpacks that using a different entrypoint or something like that...

@rth you might know better how to do this!

We could also make sure to only support one of the two (three, actually) compression algorithms that conda is using currently (e.g. allow only .tar.bz2 packages?!). I have a small issue compiling zstd, it tells me that wasm-ld: error: initial memory too small, 38821440 bytes needed (any ideas, @rth ?)

I can upload the recipes I have soon. Would be cool to have a collaboration for this!

rth commented 3 years ago

Sorry for slow response @wolfv

It's not very difficult to package emscripten - I have a working recipe that I have been using to build wasm conda packages.

That's great!

For that we'd need to rewrite mamba a bit though, to use the emscripten Fetch and FileSystem APIs instead of curl.

Indeed. In pyodide, we just mount the filesystem and then it can be interacted with directly from Python. However for interacting with remote URLs one indeed need to rewrite everything with Web APIs (e.g. pyodide.open_url)

I have a small issue compiling zstd, it tells me that wasm-ld: error: initial memory too small, 38821440 bytes needed

Likely https://github.com/emscripten-core/emscripten/blob/1216d230eac6a335f1397f4ab1d2bf297113633b/src/settings.js#L152 needs to be increased with a env variable.

Would be cool to have a collaboration for this!

Yes, that would be great. For pyodide that would mean incrementally get closer to the conda-forge apporach and possibly start using some of the tooling.

One thing where I would be interested in your feedback is assuming python packages are built with emscripten on conda-forge, would the current way of detecting it at build time in setup.py is appropriate (currently via the PYODIDE_PACKAGE_ABI env variable) or would you propose some other solution? It would be good to agree on that before that variable end up in too many upstream projects. Or do you think that just a env variable for cross-compilation (e.g. as done by @isuruf in https://github.com/scikit-learn/scikit-learn/pull/18884) would be enough?

wolfv commented 3 years ago

Are you interested in helping me to bootstrap some "emscripten" enabled recipes? I can create a repo, and we could run a Azure pipeline to build a couple of wasm compiled conda packages.

rth commented 3 years ago

@wolfv yes, we could try to experiment there.

wolfv commented 3 years ago

Ok, I'll have to get emscripten on conda-forge first, then I'll set it up.

rth commented 1 year ago

So overall emscripten-forge went this road, we are more focused on producing wheels with PyPA / cibuildwheel tooling. But we can certainly open more specific issues about ways to share some of the tools or approaches.

westurner commented 1 year ago

Here are the current docs. Should they mention MambaLite – which installs packages from emscripten-forge instead of conda-forge, which doesn't host WASM packages – as a third-party tool https://pyodide.org/en/stable/usage/loading-packages.html :

Loading packages

Only the Python standard library is available after importing Pyodide. To use other packages, you’ll need to load them using either:

  • pyodide.loadPackage for packages built with Pyodide, or

  • micropip.install for pure Python packages with wheels available on PyPI or from other URLs.

Note: micropip can also be used to load packages built in Pyodide (in which case it relies on pyodide.loadPackage).

If you use pyodide.loadPackagesFromImports Pyodide will automatically download all packages that the code snippet imports. This is particularly useful for making a repl since users might import unexpected packages. At present, loadPackagesFromImports will not download packages from PyPI, it will only download packages included in the Pyodide distribution. See Packages built in Pyodide to check the full list of packages included in Pyodide.

"Mamba meets JupyterLite" (2022-07) https://blog.jupyter.org/mamba-meets-jupyterlite-88ef49ac4dc8

  • Binderlite: A JupyterLite / emscripten-forge powered version of Binder

From https://github.com/emscripten-forge/empack :

Tools to pack a conda / mamba environment into a JS & WASM bundle

empack pack env --env-prefix /path/to/env \
  --outname python_data  \
  --config /path/to/config.yaml

This will generate two files python_data.js and python_data.data that you can use in the browser. A sample config is located in tests/empack_test_config.yaml

@DerThorsten But where is the source and recipe/feedstock for MambaLite?

DerThorsten commented 1 year ago

picomamba /mamba-lite is here: https://github.com/mamba-org/picomamba But this is not yet working since the conda-noarch pkgs are missing some CORS headers to download them in the browser.

westurner commented 1 year ago

FROM https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/displaying-a-sponsor-button-in-your-repository#displaying-a-sponsor-button-in-your-repository :

Here's an example FUNDING.yml file:

github: [octocat, surftocat]
patreon: octocat
tidelift: npm/octo-package
custom: ["https://www.paypal.me/octocat", octocat.com]
westurner commented 1 year ago

From https://twitter.com/simonw/status/1559969074607599617 w/ @simonw re: WASM package security controls and the just in a browser tab software supply chain :

I added plugin support to Datasette Lite - my distribution of @datasetteproj that runs entirely in the browser using Python and SQLite compiled to WebAssembly

You can now install extra plugins by adding ?install=plugin-name to the Datasette Lite URL

From https://simonwillison.net/2022/Aug/17/datasette-lite-plugins/ :

Since the ?install= parameter is being passed directly to micropip.install() you don’t even need to provide names of packages hosted on PyPI—you could instead provide the URL to a wheel file that you’re hosting elsewhere.

This means you can use ?install= as a code injection attack—you can install any Python code you want into the environent. I think that’s fine—the only person who will be affected by this is the user who is viewing the page, and the lite.datasette.io domain deliberately doesn’t have any cookies set that could cause problems if someone were to steal them in some way.

FWIU packages are persisted w/ SQLite in WASM, per-request?

"Pypi.org is running a survey on the state of Python packaging" (2022) https://news.ycombinator.com/item?id=32751603

rth commented 1 year ago

Should they mention MambaLite

Happy to add emscripten-forge under related projects (please open a PR) however I don't think this belongs, at least for now, in the main section on how to install packages (which is already rather confusing ). They are alternative distributions. For instance, you won't see in the pip documentation, "well you can also install this with conda", or conda-forge advertising to use homebrew in the official documentation.

Though in any case, it would be good to figure out binary compatibility first between Pyodide and emscripten-forge packages

Also, I'm very happy to chat @DerThorsten and see points on which we could work together. For instance, we are currently unvendoring micropip from the monorepo so it's easier to reuse if necessary https://github.com/pyodide/pyodide/issues/3093

My point in closing this issue was that we can open more specific discussion points, but we can also continue the discussion here if you prefer.

@westurner thank you for your comments with this information, but I'm not entirely sure what you proposing though :)

DerThorsten commented 1 year ago

Though in any case, it would be good to figure out binary compatibility first between Pyodide and emscripten-forge packages

agreed, that would be a great first step!

Also, I'm very happy to chat @DerThorsten and see points on which we could work together. For instance, we are currently unvendoring micropip from the monorepo so it's easier to reuse if necessary #3093

Would be happy to work together more!

@westurner I have no clue what you are proposing

westurner commented 1 year ago

This issue having been closed, it seemed out of the way from actual progress that it might be holding up.

There are many systems for package metadata and signed cryptographic manifests (some with per-file hash checksums). Where we have functional overlap and duplication of effort, there is potential for security vulnerability.

How should pyodide's build change to better integrate with conda-forge (and emscripten-forge)? Hopefully the aforementioned tools copy the manifest signatures over when re-packing and re-hosting.

When pyodide / micropip was written:

There are many opportunities to drop the ball in build systems and application dependency composition; DevSecOps for software supply chain security.

How can {pyodide, micropip, mambalite,} require package signatures (to a standard better than the tools they build atop) in order to prevent (widescale) exploitation of browsers with WASM and no quotas and someday, local File System Access?

rth commented 1 year ago

Yes, these are good questions. Would you mind though opening a separate issue about end-to-end package signing, as it's a very specific technical point that it would be better to discuss separately from this very general issue?

In our case, as compared to pip or conda, the attack surface is very much reduced due to the browser sandbox. But yes, the security story can certainly be improved and there is still the usage in Node which doesn't provide a sandbox.

How should pyodide's build change to better integrate with conda-forge (and emscripten-forge)?

Short answer is we don't plan to integrate with conda forge anymore. emscripten-forge is working on that. For Pyodide we are going for better integration with PyPI, PyPA tooling, cibuildwheel etc.

westurner commented 1 year ago

Are there projections as to what the load impact on PyPI and it's CDN expenses will be from WASM apps pulling dependencies on every run? https://pypi.org/sponsors/

https://discuss.python.org/t/draft-pep-pypi-cost-solutions-ci-mirrors-containers-and-caching-to-scale/3681 https://groups.google.com/g/pypa-dev/c/Pdnoi8UeFZ8

On Sun, Sep 18, 2022, 12:20 PM Roman Yurchak @.***> wrote:

Yes, these are good questions. Would you mind though opening a separate issue about end-to-end package signing, as it's a very specific technical point that it would be better to discuss separately from this very general issue?

In our case, as compared to pip or conda, the attack surface is very much reduced due to the browser sandbox. But yes, the security story can certainly be improved and there is still the usage in Node which doesn't provide a sandbox.

How should pyodide's build change to better integrate with conda-forge (and emscripten-forge)?

Short answer is we don't plan to integrate with conda forge anymore. emscripten-forge is working on that. For Pyodide we are going for better integration with PyPI, PyPA tooling, cibuildwheel etc.

— Reply to this email directly, view it on GitHub https://github.com/pyodide/pyodide/issues/795#issuecomment-1250341562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNS2LCDVBMQ5A57EMALTV646OBANCNFSM4TP353FA . You are receiving this because you were mentioned.Message ID: @.***>

westurner commented 1 year ago

(I must have confused "building pyodide" / "building pyodide packages like conda-forge / emscripten-forge" with "just have piplite solve and install the dependency graph from PyPI for every page load")

On Sun, Sep 18, 2022, 2:30 PM Wes Turner @.***> wrote:

Are there projections as to what the load impact on PyPI and it's CDN expenses will be from WASM apps pulling dependencies on every run? https://pypi.org/sponsors/

  • "[Discussions on Python.org] [Packaging] Draft PEP: PyPI cost solutions: CI, mirrors, containers, and caching to scale"

https://discuss.python.org/t/draft-pep-pypi-cost-solutions-ci-mirrors-containers-and-caching-to-scale/3681 https://groups.google.com/g/pypa-dev/c/Pdnoi8UeFZ8

  • Pip downloads wheels for every CI build and deployment
  • Pip does not download wheels for every process invocation
  • Micropip downloads wheels for every page load process invocation from PyPI
  • MambaLite downloads empkg rebuilds of conda packages unless there's a noarch conda package, but their HTTP headers aren't setup to CDN for "download every package on every process invocation / page load"

On Sun, Sep 18, 2022, 12:20 PM Roman Yurchak @.***> wrote:

Yes, these are good questions. Would you mind though opening a separate issue about end-to-end package signing, as it's a very specific technical point that it would be better to discuss separately from this very general issue?

In our case, as compared to pip or conda, the attack surface is very much reduced due to the browser sandbox. But yes, the security story can certainly be improved and there is still the usage in Node which doesn't provide a sandbox.

How should pyodide's build change to better integrate with conda-forge (and emscripten-forge)?

Short answer is we don't plan to integrate with conda forge anymore. emscripten-forge is working on that. For Pyodide we are going for better integration with PyPI, PyPA tooling, cibuildwheel etc.

— Reply to this email directly, view it on GitHub https://github.com/pyodide/pyodide/issues/795#issuecomment-1250341562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMNS2LCDVBMQ5A57EMALTV646OBANCNFSM4TP353FA . You are receiving this because you were mentioned.Message ID: @.***>

hoodmane commented 1 year ago

@westurner I opened #3127 for CDN resource usage discussion.

yuvipanda commented 1 year ago

Just as an FYI, I opened https://github.com/conda-forge/staged-recipes/pull/20961 adding micropip as a conda-forge noarch package, which should (I hope? idk :D) allow emscriptenforge and micropip to be used together like how conda-forge and pip can be used together.

wolfv commented 1 year ago

awesome @yuvipanda – that's great news. It didn't even come to my mind to just add it to conda-forge :)

westurner commented 10 months ago

@yuvipanda