Presubmission Inquiry for sciform (float -> string scientific formatting)

Submitting Author: Justin Gerber (@jagerber48)
Package Name: sciform One-Line Description of Package: Provides extended functionality for formatting floats into strings according to scientific standards Repository Link (if existing): https://github.com/jagerber48/sciform

Code of Conduct & Commitment to Maintain Package

[X] I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package after should it be accepted.
[X] I have read and will commit to package maintenance after the review as per the pyOpenSci Policies Guidelines.

Description

Include a brief paragraph describing what your package does: sciform is used to convert python float objects into strings according to a variety of user-selected scientific formatting options including fixed-pointa and decimal and binary scientific and engineering notations. Where possible, formatting follows documented standards such as those published by BIPM or IEC. sciform provides certain options, such as engineering notation, well-controlled significant figure rounding, and separator customization which are not provided by the python built-in format specification mini-language (FSML). In addition, sciform provides functionality for formatting pairs of floats as value +/- uncertainty pairs according to a variety of scientific standards.

Community Partnerships

We partner with communities to support peer review with an additional layer of checks that satisfy community requirements. If your package fits into an existing community please check below:

[ ] Pangeo
- [ ] My package adheres to the Pangeo standards listed in the pyOpenSci peer review guidebook

Scope

Please indicate which category or categories this package falls under:

Scope

Please indicate which category or categories. Check out our package scope page to learn more about our scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):
- [ ] Data retrieval
- [ ] Data extraction
- [ ] Data processing/munging
- [ ] Data deposition
- [ ] Data validation and testing
- [X] Data visualization
- [ ] Workflow automation
- [ ] Citation management and bibliometrics
- [ ] Scientific software wrappers
- [ ] Database interoperability

Domain Specific & Community Partnerships

- [ ] Geospatial
- [ ] Education
- [ ] Pangeo
- [ ] Unsure/Other (explain below)

Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of: Sciform allows for improved formatting of floats into strings according to scientific standards. These strings will be output to terminals, plots, data documents, text documents, and possibly more. Making the displayed strings more readable as per scientific standards improves the visualization of "printed number" data.

There are no existing community partnerships for this project, though there may be opportunities for education around significant figures and uncertainty.

Who is the target audience and what are the scientific applications of this package?
Any scientist who uses python is in the potential target audience for this package, but especially those who are concerned with displaying data values in a way that is commensurate with the corresponding uncertainties. Most scientists likely use the python built-in string formatting for this purpose, but there are some shortcomings to python built-in formatting. Scientists who seek more formatting features could consider sciform.
Are there other Python packages that accomplish similar things? If so, how does yours differ? Yes there are similar packages.
1. Python built-in string formatting mini language (https://docs.python.org/3/library/string.html#format-specification-mini-language). sciform includes its own string formatting mini language closely based on the built in one, but with some differences. Notably sciform includes well-controlled significant figure formatting, engineering notation, binary formatting, SI/IEC prefix substitution, digit grouping and decimal symbol options (helpful for a diversity of locales), exponent value coercion, as well as value +/- uncertainty formatting functionality.
2. The uncertainties package (https://pythonhosted.org/uncertainties/). sciform was heavily motivated by this package. This package has sophisticated statistical handling of value +/- uncertainty pairs, handling error propagation and simulation under-the-hood. In addition, it has its own extension of the mini language for formatting value +/- uncertainty pairs. sciform has more formatting functionality than the uncertainties package including, especially, engineering notation, grouping separator controls, and prefix substitution. sciform is also a much lighter weight requirement than the uncertainties package. This may be desirable when a user wants to format strings, but they don't need the rest of the full statistical machinery of the uncertainties package.
3. The prefixed package (https://github.com/Rockhopper-Technologies/prefixed). sciform was also motivated by the prefixed package. This package provides a sort of engineering notation where exponents are rounded to multiples of 3, and then exponents area always replaced with their corresponding SI exponent. prefixed package is a more conservative extension of the built-in formatting language. sciform includes more functionality including engineering notation without prefix substitution and more grouping/decimal symbol control. sciform also includes global configuration options for handling optional SI prefixes such as c, d, da, and h.
4. The sigfig package (https://sigfig.readthedocs.io/en/latest/). The sigfig package has similar functionality to sciform including sig fig rounding, separator control, value +/- uncertainty formatting including some features that are only forthcoming in sciform. sig fig does not currently support binary formatting. sig fig also does not provide a format specification mini language for formatting floats. Rather floats are formatted using an overload of the built-in round function which I find to be slightly awkward compared to a Formatter object or function.
Any other questions or issues we should be aware of: Much of the code is still a work in progress. I'm still working on documenting the existing features, more unit tests are necessary for existing features, and the value +/- uncertainty features are still young and not thoroughly tested. I have important ideas in mind for more value +/- uncertainty formatting features. But I would say the core of the package is in place. One glaring gap for this package is support for Decimal number rather that float numbers. I would like to add that functionality after the functionality for formatting floats is stable.

This package is very new and has 1 user so far. Me. But, I've been kicking around code for this sort of formatting for quite some time now and think many others would find it useful. Having a small authoritative package for this sort of formatting could be useful for the scientific community. There is also some interest in getting some of these features into the python built in string formatting feature set which would be very useful. Having a package like this could be a stepping stone towards that. See https://discuss.python.org/t/new-format-specifiers-for-string-formatting-of-floats-with-si-and-iec-prefixes/26914/46. Though I do note that the format specification mini language is intentionally not 100% backwards compatible with the built in format specification mini language, so it would not be a top candidate for that role.

I'm also not very experienced when it comes to contributing to open source software. This is one of my first forays into that world, so I am learning as I go.

P.S. Have feedback/comments about our review process? Leave a comment here

Welcome @jagerber48 and thanks so much for your detailed pre-submission inquiry.

After a first read-through, sciform looks like it could be in scope for a pyOpenSci review, but we'd like to ask you for a little more information.

I see above you marked the package as pangeo affiliated -- can you confirm whether that's the case, or was it a mistake? Maybe something about our form wasn't clear there?
The goals for sciform seem clear, but can you please say more about who the target audience is, and what the use cases are? We very much welcome newer packages for review, so it's fine if you feel like you currently the only user, but who do you intend to use it more broadly? Along the same lines: the docs have a lot of great detail about how to use sciform, but I'm not finding a lot about who would use it, where and when. Should I use it to format papers or reports? Should I use it to make numerical experiments more replicable? I have a feeling you already have some use cases in mind, but maybe you are so focused on development right now that you just haven't written a lot about those yet. Some vignettes in the documentation with walkthroughs of use cases would really help, something like "here's how you'd use sciform to do x". Please let us know more about who you're developing sciform form and when you see them using it.

Hello @NickleDave, thanks for your response!

Yes, the Pangeo affiliation was a misunderstanding on my part. I thought pyopensci had a requirement to be conformant with Pangeo in addition to pyopensci, so I was indicating willingness to conform there as well. I've unchecked that box.
Good question. I will think about this but here are my off-the-cuff thoughts.
- My main use case for this sort of formatting is when I am printing data analysis results to a terminal (either in some console program) or in a jupyter notebook. I'm trying to view data analysis results part way through a workflow or at the end of the workflow and I would like to see them, e.g. in engineering notation rounded to 3 significant figures. For this case the main "consumers" of sciform outputs are python print or logger functions. Sometimes I make a table using the tabulate package, fill it with formatted strings, and print that to the terminal.
- The next important use case would be integrating these formatted strings into plots in a variety of ways. One very nice example would be to plot tick labels using prefix notation. e.g., instead of having "1e3" out at the edge of the axis, you would have tick labels like 1 k, 2 k, 10 k, 50 k, etc. of course including strings like this on annotations could be useful as well.
- The uncertainties package provides the ability to translate formatted value +/- uncertainty strings into latex code, but I need to understand what the use cases are there. I'm not advanced enough in latex to understand how someone would integrate python data analysis directly into latex source code for a document. I would just copy the number into latex manually and round/format them as needed, but maybe there's a more sophisticated workflow I'm missing. edit: I think the Latex mode might be useful for putting strings with superscripts and pretty +/- symbols into matplotplib graphs.
- There might be use cases for writing formatted data into e.g. csv or other types of files for storing data. But I would suggest that if you are storing numbers you should store them appropriately as ints or floats, not strings.

I can include the tabulate and matplotlib use cases in the documentation. I think those would be illustrative use cases people could look at.

hey @jagerber48 !! 👋 welcome! I just had a question not related to this specific review. What on that form would make it more clear that pangeo is an option thing? we have an affiliated partner program and that check just allows someone to ALSO become pangeo affiliated. But it's not a requirement. How could we make that more clear as you are not the first person to be confused by that!!

Also i'm wondering then if this tool would really be a support tool for reproducible reports (which is important to our open science goals)? If it's really about printing and output. Does that type of application (reproducible reports/ jupyter notebook output, etc). resonate with your goals for the tool?

@lwasser Thank you for the welcome and your questions/comments!

About the Pangeo option from my perspective: Something like "You may optionally choose to affiliate your package with additional communities by checking the boxes below. These affiliations may come with XYZ benefits/additional requirements" Even just an "(optional)" flag may have cleared me. "If your package fits into an existing community please check below:" is a challenging sentence be cause I don't know what these communities are and I didn't want to learn it at the time. So yeah, replacing this with something like "If you would like to affiliate your package with an existing community, please check below" would have helped me I think.

Also i'm wondering then if this tool would really be a support tool for reproducible reports (which is important to our open science goals)? If it's really about printing and output. Does that type of application (reproducible reports/ jupyter notebook output, etc). resonate with your goals for the tool?

"would really be a support tool for reproducible reports". What are "reproducible reports"? The tool takes python floats or float pairs and converts them to formatted (hopefully human readable) strings. There are many ways these strings could be used, it sounds like "reproducible reports" is definitely a use case that this tool could support. You mention Jupyter notebook output, that's definitely something I use it for, so I would say this does resonate with my goals for the tool.

@NickleDave

I've updated the documentation to include my prototypical use case: https://sciform.readthedocs.io/en/stable/examples.html. Here I am doing two visualization tasks. I have x, y data which I am fitting an extracting best fit parameters for. The first visualization task is plotting the data. The second visualization task is displaying the best fit parameters (and their uncertainties from the fit routine) in a table.

sciform helps with the first plotting task by making it relatively straight forward (though with some admittedly not 100% straightforward helper functions) to convert the tick labels into SI prefix format.

sciform helps with the second table task by making it easy to format value/uncertainty pairs together for easy reading and order of magnitude comparison.

I imagine sciform will typically be used in python scripts or notebooks after some data analysis has been done, and now the user want to print analysis results to the terminal or notebook output. However, the result could also be saved into some sort of human-readable, text-based report which lives in memory or which is saved to the disk.

Instead of using sciform immediately at the conclusion of analysis, users could also use sciform while traversing a non-human readable data file to generate a rounded, human-readable version or summary of that data file. For example if the data file contains numeric or value/uncertainty type data.

I imagine adding an option to format strings into a "pretty" format using unicode characters and also a "latex" format similar to the uncertainties and other float formatting packages I linked above. Especially the "latex" format will open up more use cases for plotting (matplotlib requires latex for some formatting tasks) and report generation.

@NickleDave I'm curious what next steps are for this. It seems like the package is likely in scope for pyopensci. Does that mean the next step is to actually submit the package and work towards meeting those requirements?

Hi @jagerber48 thank you for your patience--we wanted to get input from other community members about whether this package was in scope.

Thank you also for updating the documentation with a use case. That is exactly the kind of concrete example that really helps users understand what you are trying to do for them.

We have decided that, yes, we will proceed with a review.

Please go ahead and make a full submission. Be sure to mention this issue by number when you do so ("as discussed in #114") and please be sure to complete the pre-review survey when you do make the submission. Appreciate it!

Once you have opened that issue referencing this one, I will close this. We will then put out a call for an editor and reviewers.

@NickleDave ok great, thank you for your response! I'll be going on a two week vacation starting this weekend and I haven't yet had time to make the full submission yet. I will work on it, as per all your instructions, when I return.

Thanks for letting me know @jagerber48 -- no rush. Have a good vacation!

@NickleDave I've made the full submission at https://github.com/pyOpenSci/software-submission/issues/121.

One question before next steps: I have a few high level and lower design questions about the package. Some are about the overall architecture of the code and some are about "should I include this feature or this requirement". I'm curious if these types of questions are in-scope for the code review. Or if the code review should be thought of as reviewing the quality of the code and giving general advice based on the code at one snapshot in time (at one version number). I may as well mention some specific questions I have here and then you can better inform me about their appropriateness for discussion. These are the questions I have that I'm not sure are in scope for review. I also have some questions that I'm more sure are in scope for review (like should I add more unit tests, how can I improve continuous integration).

All of the formatting options are captured as data fields on the FormatOptions object. However, these options need to be repeated in full many times throughout the code in function signatures and bodies for a few reasons. What can be done to mitigate this repetition? Specifically, this repetition means that a lot of (somewhat error-prone) work needs to be done if I ever want to add new options.
How broad should the scope of the package be? Should it JUST be responsible for formatting individual numbers or pairs of numbers? Or should it support some obvious possible utility functions such as (1) performing arithmetical operations on sfloat (sciform formattable float objects) or (2) formatting lists of numbers? Should sciform get into more involved formatting involving units?
- For example, one nice way to map the sciform functions over sequences or arrays is using np.vectorize. But it is worth making sciform depend on numpy for this?
If an sfloat class is supported should an sDecimal class be supported?
Right now the default plus/minus value/uncertainty separator is "+/-". But given unicode is prolific now, should the default be "±"?

Hi @jagerber48, happy to help.

These are all good questions to ask yourself as a developer, and I have definitely found myself pondering similar questions before.

However, I can't give you a detailed answer here, because I would feel like I'm starting to review.

In fact, some of these questions start to be about scope, and ideally we should not run a review just for the purpose of figuring out scope. That's something that should be determined ahead of time.

We do want to help you though.
Let's do the following in this case:

We'll hold off on starting the review for now. We will still review, but let's make sure you have answers to at least some of these questions first.
Please ask your questions above in our discourse forum, using the coding help category. You can ask however you feel is appropriate, but my guess is that it you'll get the most out of it if you identify the questions that are most crucial to figure out, and create a separate "topic" (post) for each one. Someone who is unfamiliar with the package will need a lot of context just to understand one question, so you are more likely to get a helpful answer if you can present each question clearly, with a concise title and as much context as possible.

A related practice that I find helpful is to keep a "dev diary". I write down questions like this each day I do dev work, and I also prioritize my to-dos. If the same questions or ideas keep popping up, then it helps me know that I really need to prioritize working on them. I also include links to other code, papers, etc., that give me concrete examples--if I can't find anyone else who is doing what I have in mind, then that tells me something.

Hope that's somewhat helpful--I'm only telling you because I wish I had gotten into this practice much sooner, along with using project management tools like GitHub Projects.

Please ask these questions on our forum and let's take it from there. Let's time box that process--say, two weeks max--and then we'll start the review.

@NickleDave thank you very much for the response, that is the sort of stuff I was looking for and is very helpful! The dev diary would definitely be helpful for me and I will look into GitHub projects. thank you for these pieces of advice.

I asked my question about the formatting options proliferation here. That is one spot I hope to improve the code. Perhaps this specific question about code organization/repetition is actually in scope for the review process?

After typing out but not posting a new topic on the scope questions (especially the list and arithmetic features) I've decided to take the following approach. I'll start out with the most conservative approach. So the package will be strictly for formatting individual numbers or pairs of numbers with a lot of possible formatting options. No arithmetic, no sequence/array handling and no numpy dependency. The inclusion or exclusion of these features doesn't change the core functionality of the package and I can structure (and have structured) the code so that these can be added a additional features at any time. So I'll go forward with a review without these features for now. Regarding the sfloat/sDecimal question: right now I just have one class SciNum that doesn't provide arithmetic, it just stores a single number and can format it. Only if I want to support arithmetic in the future will I need to re-address this question.

The "±" question still stands but is very minor and also doesn't block review. However, it may block releasing version 1.0.0, but I think I can discuss that separately independent of the review.

That's perfect, thank you @jagerber48.

The question on the forum is very well stated and I think you will get good feedback.

I think you are exactly right to take a more conservative approach for now. One thing I see happen is that developers get excited about adding new features and solving the related programming problems. There's nothing wrong with that, of course. (It's one of the reasons we like doing this stuff!) But it can take time away from "road-testing" the existing functionality out in the real world. My sense is that you'll get more out of focusing on that for now.

Perhaps this specific question about code organization/repetition is actually in scope for the review process?

Yes. Let's do the following:

[ ] Please reply to my comment on the submission, saying you have come to a decision on most of the questions except the one you posted in the forum
[ ] In that same comment, link to the question in the forum, and say you would like reviewers to take that into consideration as they do the review
[ ] If you make changes based on feedback in the forum, you can publish a release candidate so that reviewers have access to these changes during the review without you releasing a new full version before we finish

I'm going to close this presubmission issue since we have the submission open. Let's continue discussion there

pyOpenSci / software-submission