pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
90 stars 35 forks source link

Presubmission Inquiry for sciform (float -> string scientific formatting) #114

Closed jagerber48 closed 10 months ago

jagerber48 commented 1 year ago

Submitting Author: Justin Gerber (@jagerber48)
Package Name: sciform One-Line Description of Package: Provides extended functionality for formatting floats into strings according to scientific standards Repository Link (if existing): https://github.com/jagerber48/sciform


Code of Conduct & Commitment to Maintain Package

Description

Community Partnerships

We partner with communities to support peer review with an additional layer of checks that satisfy community requirements. If your package fits into an existing community please check below:

Scope

Scope

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo
- [ ] Unsure/Other (explain below)

There are no existing community partnerships for this project, though there may be opportunities for education around significant figures and uncertainty.

This package is very new and has 1 user so far. Me. But, I've been kicking around code for this sort of formatting for quite some time now and think many others would find it useful. Having a small authoritative package for this sort of formatting could be useful for the scientific community. There is also some interest in getting some of these features into the python built in string formatting feature set which would be very useful. Having a package like this could be a stepping stone towards that. See https://discuss.python.org/t/new-format-specifiers-for-string-formatting-of-floats-with-si-and-iec-prefixes/26914/46. Though I do note that the format specification mini language is intentionally not 100% backwards compatible with the built in format specification mini language, so it would not be a top candidate for that role.

I'm also not very experienced when it comes to contributing to open source software. This is one of my first forays into that world, so I am learning as I go.

P.S. Have feedback/comments about our review process? Leave a comment here

NickleDave commented 1 year ago

Welcome @jagerber48 and thanks so much for your detailed pre-submission inquiry.

After a first read-through, sciform looks like it could be in scope for a pyOpenSci review, but we'd like to ask you for a little more information.

  1. I see above you marked the package as pangeo affiliated -- can you confirm whether that's the case, or was it a mistake? Maybe something about our form wasn't clear there?

  2. The goals for sciform seem clear, but can you please say more about who the target audience is, and what the use cases are? We very much welcome newer packages for review, so it's fine if you feel like you currently the only user, but who do you intend to use it more broadly? Along the same lines: the docs have a lot of great detail about how to use sciform, but I'm not finding a lot about who would use it, where and when. Should I use it to format papers or reports? Should I use it to make numerical experiments more replicable? I have a feeling you already have some use cases in mind, but maybe you are so focused on development right now that you just haven't written a lot about those yet. Some vignettes in the documentation with walkthroughs of use cases would really help, something like "here's how you'd use sciform to do x". Please let us know more about who you're developing sciform form and when you see them using it.

jagerber48 commented 1 year ago

Hello @NickleDave, thanks for your response!

  1. Yes, the Pangeo affiliation was a misunderstanding on my part. I thought pyopensci had a requirement to be conformant with Pangeo in addition to pyopensci, so I was indicating willingness to conform there as well. I've unchecked that box.

  2. Good question. I will think about this but here are my off-the-cuff thoughts.

    • My main use case for this sort of formatting is when I am printing data analysis results to a terminal (either in some console program) or in a jupyter notebook. I'm trying to view data analysis results part way through a workflow or at the end of the workflow and I would like to see them, e.g. in engineering notation rounded to 3 significant figures. For this case the main "consumers" of sciform outputs are python print or logger functions. Sometimes I make a table using the tabulate package, fill it with formatted strings, and print that to the terminal.
    • The next important use case would be integrating these formatted strings into plots in a variety of ways. One very nice example would be to plot tick labels using prefix notation. e.g., instead of having "1e3" out at the edge of the axis, you would have tick labels like 1 k, 2 k, 10 k, 50 k, etc. of course including strings like this on annotations could be useful as well.
    • The uncertainties package provides the ability to translate formatted value +/- uncertainty strings into latex code, but I need to understand what the use cases are there. I'm not advanced enough in latex to understand how someone would integrate python data analysis directly into latex source code for a document. I would just copy the number into latex manually and round/format them as needed, but maybe there's a more sophisticated workflow I'm missing. edit: I think the Latex mode might be useful for putting strings with superscripts and pretty +/- symbols into matplotplib graphs.
    • There might be use cases for writing formatted data into e.g. csv or other types of files for storing data. But I would suggest that if you are storing numbers you should store them appropriately as ints or floats, not strings.

I can include the tabulate and matplotlib use cases in the documentation. I think those would be illustrative use cases people could look at.

lwasser commented 1 year ago

hey @jagerber48 !! 👋 welcome! I just had a question not related to this specific review. What on that form would make it more clear that pangeo is an option thing? we have an affiliated partner program and that check just allows someone to ALSO become pangeo affiliated. But it's not a requirement. How could we make that more clear as you are not the first person to be confused by that!!

Also i'm wondering then if this tool would really be a support tool for reproducible reports (which is important to our open science goals)? If it's really about printing and output. Does that type of application (reproducible reports/ jupyter notebook output, etc). resonate with your goals for the tool?

jagerber48 commented 1 year ago

@lwasser Thank you for the welcome and your questions/comments!

About the Pangeo option from my perspective: Something like "You may optionally choose to affiliate your package with additional communities by checking the boxes below. These affiliations may come with XYZ benefits/additional requirements" Even just an "(optional)" flag may have cleared me. "If your package fits into an existing community please check below:" is a challenging sentence be cause I don't know what these communities are and I didn't want to learn it at the time. So yeah, replacing this with something like "If you would like to affiliate your package with an existing community, please check below" would have helped me I think.

Also i'm wondering then if this tool would really be a support tool for reproducible reports (which is important to our open science goals)? If it's really about printing and output. Does that type of application (reproducible reports/ jupyter notebook output, etc). resonate with your goals for the tool?

"would really be a support tool for reproducible reports". What are "reproducible reports"? The tool takes python floats or float pairs and converts them to formatted (hopefully human readable) strings. There are many ways these strings could be used, it sounds like "reproducible reports" is definitely a use case that this tool could support. You mention Jupyter notebook output, that's definitely something I use it for, so I would say this does resonate with my goals for the tool.

jagerber48 commented 1 year ago

@NickleDave

I've updated the documentation to include my prototypical use case: https://sciform.readthedocs.io/en/stable/examples.html. Here I am doing two visualization tasks. I have x, y data which I am fitting an extracting best fit parameters for. The first visualization task is plotting the data. The second visualization task is displaying the best fit parameters (and their uncertainties from the fit routine) in a table.

sciform helps with the first plotting task by making it relatively straight forward (though with some admittedly not 100% straightforward helper functions) to convert the tick labels into SI prefix format.

sciform helps with the second table task by making it easy to format value/uncertainty pairs together for easy reading and order of magnitude comparison.

I imagine sciform will typically be used in python scripts or notebooks after some data analysis has been done, and now the user want to print analysis results to the terminal or notebook output. However, the result could also be saved into some sort of human-readable, text-based report which lives in memory or which is saved to the disk.

Instead of using sciform immediately at the conclusion of analysis, users could also use sciform while traversing a non-human readable data file to generate a rounded, human-readable version or summary of that data file. For example if the data file contains numeric or value/uncertainty type data.

I imagine adding an option to format strings into a "pretty" format using unicode characters and also a "latex" format similar to the uncertainties and other float formatting packages I linked above. Especially the "latex" format will open up more use cases for plotting (matplotlib requires latex for some formatting tasks) and report generation.

jagerber48 commented 1 year ago

@NickleDave I'm curious what next steps are for this. It seems like the package is likely in scope for pyopensci. Does that mean the next step is to actually submit the package and work towards meeting those requirements?

NickleDave commented 1 year ago

Hi @jagerber48 thank you for your patience--we wanted to get input from other community members about whether this package was in scope.

Thank you also for updating the documentation with a use case. That is exactly the kind of concrete example that really helps users understand what you are trying to do for them.

We have decided that, yes, we will proceed with a review.

Please go ahead and make a full submission. Be sure to mention this issue by number when you do so ("as discussed in #114") and please be sure to complete the pre-review survey when you do make the submission. Appreciate it!

Once you have opened that issue referencing this one, I will close this. We will then put out a call for an editor and reviewers.

jagerber48 commented 1 year ago

@NickleDave ok great, thank you for your response! I'll be going on a two week vacation starting this weekend and I haven't yet had time to make the full submission yet. I will work on it, as per all your instructions, when I return.

NickleDave commented 1 year ago

Thanks for letting me know @jagerber48 -- no rush. Have a good vacation!

jagerber48 commented 11 months ago

@NickleDave I've made the full submission at https://github.com/pyOpenSci/software-submission/issues/121.

One question before next steps: I have a few high level and lower design questions about the package. Some are about the overall architecture of the code and some are about "should I include this feature or this requirement". I'm curious if these types of questions are in-scope for the code review. Or if the code review should be thought of as reviewing the quality of the code and giving general advice based on the code at one snapshot in time (at one version number). I may as well mention some specific questions I have here and then you can better inform me about their appropriateness for discussion. These are the questions I have that I'm not sure are in scope for review. I also have some questions that I'm more sure are in scope for review (like should I add more unit tests, how can I improve continuous integration).

NickleDave commented 11 months ago

Hi @jagerber48, happy to help.

These are all good questions to ask yourself as a developer, and I have definitely found myself pondering similar questions before.

However, I can't give you a detailed answer here, because I would feel like I'm starting to review.

In fact, some of these questions start to be about scope, and ideally we should not run a review just for the purpose of figuring out scope. That's something that should be determined ahead of time.

We do want to help you though.
Let's do the following in this case:

A related practice that I find helpful is to keep a "dev diary". I write down questions like this each day I do dev work, and I also prioritize my to-dos. If the same questions or ideas keep popping up, then it helps me know that I really need to prioritize working on them. I also include links to other code, papers, etc., that give me concrete examples--if I can't find anyone else who is doing what I have in mind, then that tells me something.

Hope that's somewhat helpful--I'm only telling you because I wish I had gotten into this practice much sooner, along with using project management tools like GitHub Projects.

Please ask these questions on our forum and let's take it from there. Let's time box that process--say, two weeks max--and then we'll start the review.

jagerber48 commented 11 months ago

@NickleDave thank you very much for the response, that is the sort of stuff I was looking for and is very helpful! The dev diary would definitely be helpful for me and I will look into GitHub projects. thank you for these pieces of advice.

I asked my question about the formatting options proliferation here. That is one spot I hope to improve the code. Perhaps this specific question about code organization/repetition is actually in scope for the review process?

After typing out but not posting a new topic on the scope questions (especially the list and arithmetic features) I've decided to take the following approach. I'll start out with the most conservative approach. So the package will be strictly for formatting individual numbers or pairs of numbers with a lot of possible formatting options. No arithmetic, no sequence/array handling and no numpy dependency. The inclusion or exclusion of these features doesn't change the core functionality of the package and I can structure (and have structured) the code so that these can be added a additional features at any time. So I'll go forward with a review without these features for now. Regarding the sfloat/sDecimal question: right now I just have one class SciNum that doesn't provide arithmetic, it just stores a single number and can format it. Only if I want to support arithmetic in the future will I need to re-address this question.

The "±" question still stands but is very minor and also doesn't block review. However, it may block releasing version 1.0.0, but I think I can discuss that separately independent of the review.

NickleDave commented 11 months ago

That's perfect, thank you @jagerber48.

The question on the forum is very well stated and I think you will get good feedback.

I think you are exactly right to take a more conservative approach for now. One thing I see happen is that developers get excited about adding new features and solving the related programming problems. There's nothing wrong with that, of course. (It's one of the reasons we like doing this stuff!) But it can take time away from "road-testing" the existing functionality out in the real world. My sense is that you'll get more out of focusing on that for now.

Perhaps this specific question about code organization/repetition is actually in scope for the review process?

Yes. Let's do the following:

NickleDave commented 10 months ago

I'm going to close this presubmission issue since we have the submission open. Let's continue discussion there