Open maximlt opened 1 week ago
Noting that the ability to download data while preparing the project/environment is useful for instance when an example is deployed. Otherwise, if an example requires a large dataset to be downloaded, the first user is going to have to wait a little too long :) No big deal but not great.
cc @jbednar as I am aware this is a topic you're thinking about these days.
Paper a bit outdated (5 years old) by the authors of pooch, stating that the only alternatives to their knowledge were fsspec and intake: https://github.com/fatiando/pooch/blob/main/paper/paper.md.
If we don't want to commit to Intake anymore and there's no tool replacing it that meets our needs, then I can imagine we could standardize something around a data.py
file that each project has (when they need external data), that has a command line interface (argparse) allowing to download (and unarchive) the data with the tool we choose (e.g. pooch), and also to set up the test data. We'd run this file before testing/building the project with the right command line arguments. Then, to read the data, we'd do it either in the notebook when it's simple (e.g. df = pd.read_csv('data/dataset.csv')
) and/or have a utility function in data.py
that hides/abstracts this for the more complex cases like it's done with Intake (e.g. from data import get_complex_data; ds = get_complex_data()
).
Are we not going to be able to have this capability in conda-project? Would be good to discuss that with the conda-project developers and see what the right approach could be. Projects do generally need to have data or they won't be useful...
Are we not going to be able to have this capability in conda-project? Would be good to discuss that with the conda-project developers and see what the right approach could be.
I don't know, other tools like uv, poetry, pixi don't have that built-in. I'm not sure I want to push this feature request, feel free to do so! What I'm also uncomfortable with is just the feeling we're re-creating anaconda-project, and also the idea to be locked in a tool (which so far has no users) with a unique feature.
Projects do generally need to have data or they won't be useful...
Data projects yes but that's not all application projects (e.g. a simple GUI), and library projects usually not.
Sure, but uv, poetry, and pixi aren't specifically made for data projects like those in this repo, and conda-project is, so here I'm talking about data projects. Plus the number of data projects greatly outweighs the number of application projects. E.g. there are currently 10 million Jupyter Notebooks on Github, versus maybe some hundreds of thousands of libraries that get packaged up. So I'm concerned about having a good solution for data projects, whether that solution is in conda-project or via some other tool.
Sure, but uv, poetry, and pixi aren't specifically made for data projects like those in this repo, and conda-project is, so here I'm talking about data projects.
Really? There's no single mention of the data
word on conda-project's README (https://github.com/conda-incubator/conda-project).
Plus the number of data projects greatly outweighs the number of application projects. E.g. there are currently 10 million Jupyter Notebooks on Github, versus maybe some hundreds of thousands of libraries that get packaged up.
Application projects don't include libraries in my mind, but things like API, CLI, GUI, scripts, etc. I can't tell if there are more of them than data application projects, but yes for sure there are many data projects out there.
So I'm concerned about having a good solution for data projects, whether that solution is in conda-project or via some other tool.
I'd also love to have a good solution for data projects. But as someone who got to maintain Examples for a little while, I wouldn't commit to a tool that makes it more difficult to maintain Examples (not well maintained, low adoption, etc.). In which case, I'd rather rely on something custom that can easily be migrated if need be.
Yes, really. :-) The conda-project README says:
Sharing your work is more than sharing your code in a script file or notebook. To make your work properly reproducible, it is necessary to include the list of required third-party dependencies, specifications for how to run your code, and any other files that it may need.
The "other files" includes data; what else would that be? Then it links to my "8 Levels of Reproduciblity", which was written about data projects, or at least notebooks or dashboards rather than libraries or APIs or CLIs. Then it says:
This package is intended as a successor to Anaconda Project.
Which in turn says:
Tool for encapsulating, running, and reproducing data science projects.
Take any directory full of stuff that you're working on; web apps, scripts, Jupyter notebooks, data files, whatever it may be. By adding an anaconda-project.yml to this project directory, a single anaconda-project runcommand will be able to set up all dependencies and then launch the project.
So sure, conda-project needs some better, clearer docs, but I consider it to be coming very clearly from a perspective of "package up some code with all the stuff needed to reproduce a result" rather than something like "I have written a library I want to share with other people who will then import it" or "I have written an end-user application that I want to publish on an app store".
But as someone who got to maintain Examples for a little while, I wouldn't commit to a tool that makes it more difficult to maintain Examples (not well maintained, low adoption, etc.). In which case, I'd rather rely on something custom that can easily be migrated if need be.
Well, conda-project isn't something that came from heaven; it was written by some co-workers of yours, and so I think you can either (1) contribute to making it be something that meets your needs, (2) write something completely custom, or (3) find something that already meets your needs. I haven't seen (3) show up in this thread or elsewhere, and between 1 and 2 I'd vote for 1, since collaborating on a shared tool that we together make into something valuable seems much better than us developing some custom solution just for our narrow use case, which would mean something with even lower adoption and even worse maintenance.
I've opened an issue to ask about that feature on conda-project https://github.com/conda-incubator/conda-project/issues/176
Well, conda-project isn't something that came from heaven; it was written by some co-workers of yours, and so I think you can either (1) contribute to making it be something that meets your needs, (2) write something completely custom, or (3) find something that already meets your needs. I haven't seen (3) show up in this thread or elsewhere, and between 1 and 2 I'd vote for 1, since collaborating on a shared tool that we together make into something valuable seems much better than us developing some custom solution just for our narrow use case, which would mean something with even lower adoption and even worse maintenance.
What I want more than anything else is that, when we decide to migrate away from anaconda-project (or are forced when it starts to break, e.g. with a new Python version), we pick a tool that is already widely used.
To be clear, "data project" does not necessarily imply that there is the ability to fetch data; it just means that we are expecting that most projects will somehow work with data. Fetching data is only crucial when datasets are much larger than the rest of the project such that it makes sense to treat them differently. So while I strongly consider conda-project to be about data projects primarily, whether it should have functionality about fetching data is a separate question best discussed at that issue.
anaconda-project
has a handy feature that allows to declare a series of files to download (and optionally unzip) when preparing a project (see https://anaconda-project.readthedocs.io/en/latest/user-guide/reference.html#file-downloads). Some day we will need to replaceanaconda-project
by another tool (e.g. conda-project, pixi) which, at the moment, don't provide this feature. To prepare this transition, we'll need to find an alternative way to download data.Features we use:
filename: data
) for archives to unzipPotential alternatives: