Closed skrawcz closed 1 year ago
Hi @skrawcz welcome to pyOpenSci! Thank you for your presubmission inquiry.
I'll help you figure out if the package is in scope, and if we should move to a full submission, at which time I would find an editor.
The categories you have indicated look right to me.
A couple of questions:
Could you please clarify how Hamilton relates to those other tools?
Can you talk about overlap with snakemake? If you have examples where researchers are already using Hamilton, that would help.
And please just say a little generally about what the Hamilton authors are hoping to achieve by a pyOpenSci review.
To be clear, I fully agree with you that Hamilton is a tool that scientists could potentially use for munging and reproducibility.
Thank you!
edit: removed comment that could give the wrong impression about our scope
I have not used data orchestration tools extensively but my understanding is that Hamilton is similar to tools like airflow, metaflow, dagster, luigi, etc., as one of your core contributors comments here. I'm a bit surprised at the statement that there's no other Python packages that accomplish similar things. Could you please clarify how Hamilton relates to those other tools?
Sure.
Hamilton does not replace airflow, metaflow, dagster, luigi, etc. They focus on the "macro" scheduling problem, and are systems that require state to be managed. Hamilton focuses on the "micro", i.e. what people do within the step of an airflow, metaflow, dagster, luigi, etc. task. Hamilton replaces lines of logic with functions and tries to make that part of a code base testable, documentation friendly, and maintainable, which is not the goal of those other systems.
For example, here's a blog post showing Hamilton + Metaflow - Hamilton helps with the feature engineering task, and metaflow does the macro orchestration.
Other differences:
The most similar package I'm aware of in the open science space is snakemake. If it helps, here's results from clicking on the snakemake "topic" on GitHub many examples of where researchers have shared their code that relies on the package. Again, I don't know data orchestration tools super well, but if there's signficant overlap then I'm not sure this would be quite in scope for us. Can you talk about overlap with snakemake? If you have examples where researchers are already using Hamilton, that would help.
From snakemake:
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
Snakemake is basically an orchestration system it sounds like. Hamilton is much much lighter-weight and only focused on pure python. Given the feature set snakemake has, and if that's the bar for reproducibility, then I don't think Hamilton meets it. I'll retract that category.
And please just say a little generally about what the Hamilton authors are hoping to achieve by a pyOpenSci review. To be clear, I fully agree with you that Hamilton is a tool that scientists could potentially use for munging and reproducibility. But our focus is on helping authors make sure their packages are more usable by scientists. I get the impression that development of Hamilton is already quite mature (and already supported by a tech company 🙂 ) so I'm not sure how beneficial it would be to you all to go through review.
(1) I saw pandera on the list, and I think it's a good tool for a scientist to know about (Hamilton supports integration with it), and thus thought to myself that Hamilton would be a fit here. (2) I genuinely think it's a good tool for anyone doing python data transform work (e.g. pandas), back when I was studying research code wasn't the best quality. I think tools like Hamilton & Pandera can help teams ensure their software projects live for much longer than their PhD. (3) we have people at labs picking Hamilton up, e.g. at https://www.pnnl.gov/, and so I see a fit with scientists, and thus part of applying is to learn/understand how can we make it better and easier for scientists to pick up and use Hamilton (I don't know what I don't know) :).
Thanks for the questions -- let me know what I can clarify/you want to diver deeper on.
Thank you @skrawcz, that's very helpful.
I think I understand that Hamilton is a pure Python way to do feature engineering (in brief).
I'm glad you mentioned pandera, it did also occur to me they might work well together.
If there are any public repositories from nat'l labs like PNNL that would provide examples of using Hamilton in the wild, that could also help us see its application to open science.
I hear you that one of your goals is to better understand how to reach this community of users. So your reasons for seeking review make sense to me and make me feel like this could be in scope.
I am discussing with the executive director and other editors. Please let me get back to you with any further questions or a decision by Monday at the latest.
If there are any public repositories from nat'l labs like PNNL that would provide examples of using Hamilton in the wild, that could also help us see its application to open science.
Sure -- here's what I've found in my notes: example transforms: https://github.com/IMMM-SFA/naturf/blob/feature/nodes/naturf/nodes.py how it's run: https://github.com/IMMM-SFA/naturf/blob/feature/nodes/naturf/driver.py
Hi again @skrawcz, thank you for providing that example. It's very helpful to see.
We have decided that yes we will review the package as it is in scope.
Some context on the decision for you, and us for future reference, and transparency: as I noted above, we see that Hamilton has already had support for its development, and there is a proceedings paper, although a publication review is not the same as our software review. One of our goals is to provide resources to packages that have not yet enjoyed this kind of support. But it is also within our scope to help build consistency across the whole scientific Python ecosystem. You have clearly shown that (1) you are interested in participating in this process as an author and (2) there are researchers using the code now, that are part of a community we want to build connections with.
For those reasons we will proceed with a review.
I expect that we will find an editor by early next week.
@skrawcz could you please go ahead and make a full submission issue? I will close this one once you have done so.
Thanks! Looking forward to the review 🚀
Awesome thanks @NickleDave . I'll get started on the full issue. Will finish it in the next 24-72 hours or so :)
Okay I did https://github.com/pyOpenSci/software-submission/issues/80; still working on JOSS section, otherwise I think I filled it out appropriately.
Closing this since full submission is in #80
Submitting Author: Stefan Krawczyk (@skrawcz)
Package Name: Hamilton (sf-hamilton on pypi) One-Line Description of Package: A general purpose micro-framework for defining dataflows. Repository Link (if existing): https://github.com/stitchfix/hamilton
Description
Hamilton is a general purpose micro-framework for creating dataflows from python functions! Specifically, Hamilton defines a novel paradigm, that allows you to specify a flow of (delayed) execution, that forms a Directed Acyclic Graph (DAG). It was originally built to solve the challenges in wrangling and maintaining production code to create wide (1000+) column dataframes, but has been extended to enable modeling any python object generation. Core to the design of Hamilton is a clear mapping of function name to dataflow output. That is, Hamilton forces a declarative paradigm expressed through writing python functions, and aims for DAG clarity, low code upkeep costs, ease of modification, with always unit testable and naturally documentable code.
Scope
Please indicate which category or categories this package falls under:
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
data munging Hamilton was built for a team to manage their time-series forecasting feature engineering. So it's design goal was to help data science teams maintain data munging code well.
reproducibility Core to reproducibility is sharing code. Most researchers only share data, not their code. We believe that with Hamilton, one could more easily share their implementation and in a standardized way that is approachable to a broad audience.
data extraction Kind of unsure here. But Hamilton helps you structure and "orchestrate" the code that does extraction.
data retrieval Kind of unsure here. But Hamilton helps you structure and "orchestrate" the code that does retrieval.
Anyone doing any data transformations in python.
Scientific applications: time-series forecasting, any machine learning, any work that involves executing a dataflow.
None that the author is aware of.
P.S. *Have feedback/comments about our review process? Leave a comment here