Hamilton - Githubissues

skrawcz commented 1 year ago

Submitting Author: Stefan Krawczyk (@skrawcz) All current maintainers: (@skrawcz, @elijahbenizzy) Package Name: Hamilton (sf-hamilton on pypi) One-Line Description of Package: A general purpose micro-framework for defining dataflows. Repository Link: https://github.com/stitchfix/hamilton Version submitted: 1.15.0 Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
Version accepted: TBD Date accepted (month/day/year): TBD

Description

Include a brief paragraph describing what your package does: Hamilton is a general purpose micro-framework for creating dataflows from python functions! Rather than write procedural python code in scripts, one declares python functions that together define a Directed Acyclic Graph (DAG). It was originally built to solve the (micro) challenges in wrangling and maintaining production code to create wide (1000+) column dataframes, but has been extended to enable modeling any python object generation. Hamilton aims for code clarity, low code upkeep costs, ease of modification, with always unit testable and naturally documentable code.

Scope

Please indicate which category or categories this package falls under:
- [x] Data retrieval
- [x] Data extraction
- [x] Data munging
- [ ] Data deposition
- [ ] Reproducibility
- [ ] Geospatial
- [ ] Education
- [ ] Data visualization*
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
- Who is the target audience and what are scientific applications of this package?
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

Pre-submission enquiry - https://github.com/pyOpenSci/software-submission/issues/74

Who is the target audience and what are scientific applications of this package?

Broadly speaking, anyone doing any data transformations in python. This covers writing code to retrieve, extract, and munge data. Hamilton wants to help you structure the code that you write for these tasks.

Scientific applications: primarily feature engineering (retrieve, extract, munge), and then connecting it with modeling, e.g. time-series forecasting, machine learning; basically any work that involves executing a dataflow.

data munging

Hamilton was built for a team to manage their time-series forecasting feature engineering. So it's design goal was to help data science teams maintain data munging code well.

data extraction data retrieval

I see these two as the precursor steps to data munging. With Hamilton you can "model" these as steps in your dataflow and connect it with data munging. Hamilton does not provide any special features here, other than helping you structure you code in a way to ensure that these pieces are modular and can be swapped out easily.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

Hamilton is a (Directed Acyclic Graph) DAG framework, so has similarities with other such systems that think similarly, e.g. airflow, dagster, etc. Where Hamilton differs, is that Hamilton focuses on the "micro", whereas these frameworks focus on the "macro". E.g. with Hamilton you can model feature engineering as multiple nodes in a DAG and Hamilton can help construct the dataframe that you want. You can't model that process with airflow, instead airflow wants you to string together tasks that represent "macro" things you might want to do, e.g. extract features, train a model, predict. Because of Hamilton's focus on the micro, it's actually embeddable within these "macro" frameworks, and in general, anywhere that python can run, e.g. jupyter notebooks, pyspark, python scripts, python webservices. For example, here's a blog post showing Hamilton + Metaflow - Hamilton helps with the feature engineering task, and metaflow does the macro orchestration.

Hamilton is similar in spirit to Pandera. Hamilton has a focus on testable, documentation friendly, and decoupled code, and it's only a library not a whole system that needs to be setup and managed. In this way it's similar in spirit to Pandera -- which is trying to make it simpler to maintain data pipelines. Of course Hamilton provides an integration with Pandera!

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

[x] does not violate the Terms of Service of any service it interacts with.
[x] has an OSI approved license.
[x] contains a README with instructions for installing the development version. See here.
[x] includes documentation with examples for all functions.
[x] contains a vignette with examples of its essential functions and uses.
[x] has a test suite.
[x] has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

[x] Do you wish to automatically submit to the Journal of Open Source Software? If so:

JOSS Checks

- [x] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [x] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

[x] Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

[x] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Please fill out our survey

[x] Last but not least please fill out our pre-review survey. This helps us track submission and improve our peer review process. We will also ask our reviewers and editors to fill this out.

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

skrawcz commented 1 year ago

I'm still working on the JOSS piece, but should have that done this week.

The package is deposited in a long-term repository with the DOI:

Though I'm not entirely sure how to fulfill this ^ requirement -- I haven't search for the answer yet, but if you have a link that would be appreciated.

NickleDave commented 1 year ago

🎉 excellent, thank you @skrawcz

I will reply in more detail tomorrow or Friday (have jury duty 😕) but we do already have an editor lined up.

Though I'm not entirely sure how to fulfill this ^ requirement

The easiest way would be to use Zenodo with the GitHub integration. https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

Note that we will ask you to release a new version after the review. So you can get this set up now but we do not want a DOI yet, that's for us and JOSS to be able to point to a release that's associated with the finished review.

This is not required for review but you probably also want to add a citation.cff which has GitHub integration as well (gives you a dropdown on the repo page that people can click to copy the citation): https://citation-file-format.github.io/

NickleDave commented 1 year ago

Hi @skrawcz and thank you for taking the time to make the full submission.

I need to ask about one aspect of Hamilton we did not discuss before.

I see the developers recently added opt out telemetry. Two questions:

Are users given any notification that telemetry is being used if they do not opt out? E.g., with a print statement, through logging. or warnings.warn? This was not clear to me from the README.
I am really sympathetic to the devs wanting simple usage stats, and the language in the README makes it sound like you want to limit what information is collected to just this anonymous data. But orgs similar to ours usually require that any telemetry be opt in (e.g. Linux Foundation). How open would you all be to making the telemetry opt in instead?

skrawcz commented 1 year ago

Are users given any notification that telemetry is being used if they do not opt out? E.g., with a print statement, through logging. or warnings.warn? This was not clear to me from the README.

No, but I just added it - https://github.com/stitchfix/hamilton/pull/302. Otherwise we did review this with some community members and no one had any objections to us adding it.

I am really sympathetic to the devs wanting simple usage stats, and the language in the README makes it sound like you want to limit what information is collected to just this anonymous data. But orgs similar to ours usually require that any telemetry be opt in (e.g. Linux Foundation). How open would you all be to making the telemetry opt in instead?

This is the struggle with a library. There is no easy way to have an explicit set up step. By having it opt-out (we've only had it on for a month) it's already been valuable, e.g. python version usage confirmed we could remove python 3.6 support, and otherwise is helping inform us what parts of the library are being used; we simply had no clue before. So at this stage, we're not looking to make it opt-in, and I feel comfortable with that, given that we only capture information as it relates to use of the library. Much like you can capture website behavior without requiring a cookie. At this stage we can't identify anyone by their usage, and if that was not true, or if we were to capture more identifiable information, then yes I think it should be opt-in.

NickleDave commented 1 year ago

No, but I just added it - stitchfix/hamilton#302.

Thank you 🙏 I think this goes a long way towards alleviating our concerns, and is similar to what Homebrew does (as you may already know).

By having it opt-out (we've only had it on for a month) it's already been valuable ... and ... is helping inform us what parts of the library are being used

I hear you, believe me.

As you can probably tell, this is a case where we did not yet have policy in place. We need to be sure we're doing what would be acceptable to the rest of the scientific Python community here, I hope you can understand. Please let me update you tomorrow afternoon at the latest.

NickleDave commented 1 year ago

Hi @skrawcz, following up on this:
I did have a chance to discuss with our exeuctive director @lwasser. She has decided we cannot proceed with this review until we have a policy in place regarding use of telemetry.

I understand if you are frustrated, especially after I told you we would review. It is my fault for not bringing up telemetry in the discussion on the presubmission issue.

As I said, we're sympathetic to why as a developer you would want telemetry. But we also need to consider researchers that rely on our packages and our partners in the scientific Python and open science communities.

Please understand we need to take our time here. Because we are growing, and in some ways we are a new organization, we cannot afford to upset people with a decision whose impact we have not fully considered. @lwasser has already done a lot of work getting input from our community members and sisters orgs on their telemetry policies. But she is out of office today and will be traveling early next week, so we will need to get back to you after that.

Please let me know if that makes sense.

skrawcz commented 1 year ago

@NickleDave I'm not in a hurry. Happy to wait.

NickleDave commented 1 year ago

Hi @skrawcz thank you again for your patience.

@lwasser has gotten input on telemetry from our own community, the scientific Python community, rOpenSci and other open source community leaders.

Based on that feedback, @lwasser has decided to adopt the policy that we will require that any packages using telemetry do so in an opt-in fashion only.

We are in the process of adding that to our guide here (as part of a larger PR): https://github.com/pyOpenSci/software-peer-review/pull/162/

Like I said before, we are really sympathetic to why developers would want telemetry.
But it's clear that most people strongly prefer that any library using telemetry do so only by first requesting user permission.

Please let me know how that sounds.

skrawcz commented 1 year ago

@NickleDave @lwasser I'd love to know more about how you consulted with people, just to really understand what the concerns are. My worry that a blanket "all telemetry has to be opt-in" is short-sighted. I instead, think that you should consider guidelines on what is acceptable to be tracked (and for what purpose) and what isn't, and then go further as to what requires opt-in and what requires opt-out, and finally, what has to be shared back with the community for accountability. I'd be very happy to have some guidance here -- to my knowledge there isn't a standard here -- so why not be pioneers and start one?

Otherwise a couple of thoughts:

(1) Unfortunately the mechanics of installing python packages means that an explicit prompt to the user is impossible before first use. If that was something we could do, I would, but I can't. Similarly, making something opt-in, is very unlikely to gain much adoption at all (it's not core to the experience), especially in the case of a library that is used on ephemeral machines and environments -- people would likely have to always turn it on via code, which is unlikely to happen as it adds boilerplate bloat.

(2) From our conversations with our community, we did not have a single detractor about it being a default -- therefore why make something that works for most users not the default and is only a concern for the minority?

(3) People are using free software that someone's work and time went into. I think it's a fair exchange to understand how people are using that free work to make it even better -- assuming (this is where a policy guideline would help) the data collected is reasonable for that purpose and not trying to capture too much.

(4) Telemetry really helps projects. It's what industry does to improve iteration cycles, as it improves decisions and confidence in investment within the project. E.g. for us it confirmed we could remove python 3.6 support, and gives us a good breakdown of where people are in the python version ecosystem.

More than happy to jump on a call if that's easier to discuss.

Otherwise thanks for the consideration and time you are putting into to vet and produce a curated list of libraries for the python community.

lwasser commented 1 year ago

good morning @skrawcz i'd love to chat with you more about this!

let's keep this issue open for the time being and focus attention on being said pioneers!

Here is our open PR where we're drafting our policy around telemetry. Stefan, please feel free to read and comment because every iteration has improved the language!

I'm also open to a call. finally feel free to email me leah at pyopensci.org and we can find time next week to chat!

this is something we weren't expecting to tackle so soon but i do think it's important to tackle it! I also know that other packages in the ecosystem would benefit from usage data. i am going to ping you over in a new issue about this and in the pr. we can have a conversation there. i'll also bring in a few other core scientific python devs to join in on the conversation.

I really appreciate your thoughts above and would like to engage the community in a conversation here to ensure our policies are in the best interests of BOTH users and maintainers. does that sound reasonable to you?

lwasser commented 1 year ago

open issue is here - https://github.com/pyOpenSci/software-peer-review/issues/183

skrawcz commented 1 year ago

Awesome, sounds good @lwasser. I might be a bit slow, but will try to pen some thoughts. Will send you an email too. Cheers.

lwasser commented 3 days ago

Hi everyone 👋🏻 I am going through our issues and closing issues that have been open with no activity for longer than a year. @skrawcz, if in the future you'd like to resubmit or reopen this issue, you are welcome to! For the time being, I will close this one. Please let me know if you have any questions!

pyOpenSci / software-submission

Hamilton #80

Description

Scope

Technical checks

Publication options

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

Code of conduct

Please fill out our survey

Editor and Review Templates