toml-lang / toml

Tom's Obvious, Minimal Language
https://toml.io
MIT License
19.54k stars 856 forks source link

Add nicer syntax for file sizes #912

Open JakobDev opened 2 years ago

JakobDev commented 2 years ago

It would be nice to have Filesizes in TOML.

test = 1MB # Results in 1.000 * 1.000 = 1.000.000
test2 = 1MiB # Results in 1.024 * 1.024 = 1.048.576

Things like max Filesize etc. are used in many configurations and it should be easy to implement for the different parsers.

eksortso commented 2 years ago

We would not introduce a special new type for file sizes, when integers suffice.

That said, we did talk about adding many different multiplier suffixes for integers and floats back in #427. We considered Mi instead of MiB, for instance. Ultimately, the proposal was closed four months ago. It was ambitious, but ambiguities appeared, and it was deemed too confusing in the context of, as it turns out, file sizes. Perhaps if it was limited in scope, it would pass muster, no offense intended to @JeppeKlitgaard.

Personally, I recoil at the thought of using a comment in place of a well-established suffix with a clear meaning. With all due respect to @pradyunsg, there are real problems with doing this, and comments can be deceptive.

size1a = 10_300_000  # 10.3 MB, idk
size1b = 10_800_333  # 10.3 MB, idk
size1c = 10_547_000  # 10.3 MB, idk
size1z = 20e+06      # 10.3 MB, idk+idc

I'd be open to hear a new proposal if it could garner a lot more support than the old proposal did. Make a suggestion or a PR, and rally your colleagues to show their support and spread the word. But maximize utility, expressiveness, and simplicity to make your case.

marzer commented 2 years ago

My main issue with this idea is the distinction between MB and MiB. Personally I find the MiB syntax detestable and always treat filesizes as the sensible-in-technical-domains pow2 form. I know I'm not alone here.

About the only time the (completely insane) base-10 form is acceptable when talking about file sizes is in the (unfortunately) already-established practice of hardware manufacturers trying to make their shit seem more powerful/bigger than it is (which I'm pretty sure is the root cause of the problem to begin with).

Plus, what if someone wanted to write mb, mB et cetera? I've seen all these in the wild, and they're all going to seem perfectly valid to lots of people.

Note that it's only file sizes specifically that suffer from these dumb problems. The more general suffixes proposed in #427 would have avoided all that baggage.

eksortso commented 2 years ago

@marzer wrote:

Plus, what if someone wanted to write mb, mB et cetera? I've seen all these in the wild, and they're all going to seem perfectly valid to lots of people.

ABNF uses case-insensitive matching strings by default. We'd pick up all those variants immediately just by specifying "MB".

marzer commented 2 years ago

Ah, that's true. Guess that eliminates half of my qualms above.

eksortso commented 1 year ago

@JakobDev Could you give this issue a follow-up? It's been five months since your suggestion, and it prompted some discussion on that day. But without some additional interest, continued discussion, or a PR with concrete changes proposed, this issue may be closed like the one that came before it.

JakobDev commented 1 year ago

I'm still interested in this, but I can't contribute much to the discussion here.

tintin10q commented 1 year ago

I would like to add that if you would deserialize file sizes in programming languages then what type should it be converted to? Most programming languages do not have a type for file sizes. This brings in a lot of ambiguity because there are many valid interpretations. Do you parse 4kb to a number of bytes (4000 or 4096)? Do you parse to a string '4kb'? Maybe the programming language does have a native file size type so you parse to that? This leads to differences between parsers from languages which is bad and not simple or obvious.

JakobDev commented 1 year ago

It should be parsed to the number of bytes. The functions to get the file size in programming languages are usually return the number of bytes.

e.g if you have a website that allows uploading files. you can set the max allowed file size:

[Upload]
max-size = 10MB

Pseudo python code:

if os.path.getsize(path) > config["Upload"]["max-size"]:
    show_error("Your file is too large")

if len(file_bytes) > config["Upload"]["max-size"]:
    show_error("Your file is too large")

If you want to calculate if a file is bigger/smaller than the given size, which will be the most common use case, it needs to be a number.

It should be a long to support large sizes such as 1TB, if someone needs that.

Do you parse to a string '4kb'?

If this is wanted, you can just use

[Upload]
max-size = "4kb"

Maybe the programming language does have a native file size type so you parse to that?

I don't know if there any language out there that has and I don't know why such a type is needed, but in this case, use the native filesize type of the language.

eksortso commented 1 year ago

File sizes are integers. There are no fractional bytes out there. Not touching kubits or anything!

The strangulation point is whether kilobytes are 1000 bytes or 1024 bytes. Like it or not, it's that ambiguity in storage quantities that ended #427. We haven't allowed 4k or 4Ki because nobody knows what a kibibyte is, outside of our rarified circle, or which one to use in which cases. And if we knew, we could just write 4_000 or 4_096 and be done with it.

The reason we don't allow something like 4kb is that it is not obvious. And as long as we can write 10_300_000_000 to represent 10.3 billion, then we can stay minimal and not introduce unnecessary features. I know they'd be easy for parser writers to implement. But frankly, a user writing something like max-size-kb = 4 when they know what they're dealing with makes more sense in the end.

It really pains me that we can't have SI suffixes for integers, but it's those stupid ambiguous file sizes that make our lives difficult!

rmunn commented 1 year ago

New idea for solving the file-size ambiguity: K means 1000, KB means 1024. M means 1000000 (one million), MB means 1048576 (1024*1024). And so on.

Rationale: if https://github.com/toml-lang/toml/issues/427 was ultimately rejected because, as @eksortso says, "nobody knows what a kibibyte is, outside of our rarified circle", then let's change the syntax. A suffix without a trailing B is a kilo/mega/etc as used in scientific notation (powers of 10), whereas a suffix with a trailing B means "kilobyte/megabyte" with the traditional powers-of-2 meaning thereof (1024, 1048576, and so on).

Suffix Meaning Suffix Meaning
K 1000 KB/KiB 1024
M 1000² MB/MiB 1024²
G 1000³ GB/GiB 1024³
T 1000⁴ TB/TiB 1024⁴
P 1000⁵ PB/PiB 1024⁵
E 1000⁶ EB/EiB 1024⁶
Z 1000⁷ ZB/ZiB 1024⁷
Y 1000⁸ YB/YiB 1024⁸
R 1000⁹ RB/RiB 1024⁹
Q 1000¹⁰ QB/QiB 1024¹⁰

My reasoning is that when people are working with bytes, powers of 1024 are what they expect. Even in technical circles, nobody actually says the word "kibibyte" out loud: we all talk about kilobytes even though we really do mean 1024 bytes, and nobody thinks this is ambiguous. In writing, we might write "kibibytes", but only in technical contexts where precision matters more than clear communication; most of the time if you see someone write "kilobytes" in an article, you expect that it means 1024 bytes unless the author specifically clarifies that it means 1000 bytes.

So why not allow KB, MB, etc. to default to the power-of-2 meaning that everyone expects? The K, M, etc. suffixes will remain powers of 10, to allow watts = 1.21G. But KB is 1024, MB is 1048576, and so on, because in a configuration file, that's what nearly everyone will expect.

Further parts of my proposal:

Floating-point values like 2.5MB could be treated in one of five ways:

I'm coming around to what @eksortso suggested here, which is to parse such values as floats, because it's impossible to tell which of ceiling or truncation rounding would be right for any given application. Also, one other way to handle values like 2.5MB would be to parse them as ints if unambiguous (2.5MB), but leave as floats if not a whole int (10.3MB). I believe this would be a very BAD idea, so I didn't include it in the list above, but it's worth at least mentioning if only to immediately reject the idea out of hand.

rmunn commented 1 year ago

One more consideration: the E ("exa", 10⁶) potentially conflicts with the E used in scientific notation. Adding this would make parsers more difficult to write, as encountering an E at the end of a number now has two possibilities:

  1. It's followed by a digit or a + or -, so it's part of a floating-point value written in scientific notation
  2. It's the end of the number, or it's followed by B or iB, so it's part of a suffix

Both can be parsed unambiguously, but it makes parsers slightly harder to write. We would also need to add a rule that you cannot use scientific notation and numeric suffixes together. No writing 1.0e3K to represent a thousand thousand (1 million), as that's just unnecessary cruelty to parser writers.

And another consideration: once you get up into the exabyte range, you're approaching the limits of 64-bit integers. TOML says that the values -2^63...2^63-1 (the natural range of signed 64-bit ints) must be accepted. But 8EB is 2^63, so any value of 8EB or greater would be too large to fit into a 64-bit int. My suggestion for handling this issue is the following:

  1. Values strictly less than 8EB are treated as ints or floats as mentioned above (ints if no decimal point present, floats if decimal point is present).
  2. Integer values of 8EB or higher should be represented as platform-dependent "big integer" types (called BigInt / BigInteger in most languages) if such types are available.
  3. On platforms that don't have a BigInt / BigInteger data type, integer values of 8EB or higher should be represented as floats. Any loss of precision involved in representing a large int as a float is unlikely, because the most common scenario is going to be a small mantissa and a large exponent (e.g., 3ZB would be 3×1024⁷ = 3×2⁷⁰, which can be precisely represented as a float).
tintin10q commented 1 year ago

I just don't think this should be added. It is not obvious.

As was said before just do something like max-size-kb = 4.

And another big problem with this is automatically writing a config file from something like a python dict. How would the writer know that it has to write a number back to something ending with kb. It wouldn't so then you would get the number of bytes. So this would only add something for reading filed but then as soon as you write to the file its gone.

Just don't make it more complicated.

eksortso commented 1 year ago

Personally, I would not want binary suffixes for file sizes (KB, MB, etc.) to yield anything but integers, unlike the old ki, Mi, etc. suffixes discussed in #427. But at this point, we really haven't moved at all past where we ended up in #427. Multiplier suffixes are still no more useful, expressive, or elegant. And even if we chose rules to make things more obvious, people may still wonder why we made the choices that we made.

So let me suggest we scale back. What if we imposed the following?

Outside the standard, we would need to encourage designers to use multipliers only in fields where precise values can range by several factors, from single digits to quadrillions. And to use floats, like gigawatts = 1.21 or watts = 1.21e+9, where integer precision isn't necessary.

Is this useful enough to warrant its inclusion? Does it help users to express certain values more elegantly? Or is it still confusing or not obvious, even after the rules are laid down?

eksortso commented 1 year ago

@tintin10q wrote:

And another big problem with this is automatically writing a config file from something like a python dict. How would the writer know that it has to write a number back to something ending with kb

That's a problem to be solved by the emitters, not by the standard. And it already exists; would we write a thousand like 1_000 or just 1000? Let the TOML writers figure that out.

pradyunsg commented 1 year ago

IIUC, the current proposal is:

file-size = 1M

is translated to:

{"file-size": 1000000}

and

file-size = 1MB

is translated to:

{"file-size": 1048576}  # 1024*1024

I don't see why it isn't OK for an application to instead allow "1MB" as a string value, and parse it. Parsing these values isn't particularly complex IMO and allows for a richer error messages in that specific context to come from the application itself; and potentially richer syntax ("3.25 GB" can resolve to mean one thing, if the meaning is implementation-defined -- which an application can do).


Quoting from https://github.com/toml-lang/toml/issues/427#issuecomment-1059381423:

I also think that this can be confusing on certain other contexts and that outweights the usefulness here IMO.

As a crafted example, the proposal would have the following be valid:

[observable-universe]
number-of-galaxies = { min-estimate = 100G, max-estimate = 2T }

Those are going to result in "correct" values being serialised, but... I don't like that the format would allow doing things like this -- it isn't clearer.


While I agree that it is somewhat common to have a need for file sizes, I don't think it's common enough to justify adding this sort of thing -- and it definitely is not worth adding a dedicated type for it.

(thanks @rmunn for flagging that I goofed up on the examples here)

pradyunsg commented 1 year ago

(retitled to better reflect the underlying request here)

rmunn commented 1 year ago

IIUC, the current proposal is:

file-size = 1MB

is translated to:

{"file-size": 1000000}

Not quite. The current proposal is that 1M would translate to 1_000_000, but 1MB would translate to 1_048_576 (i.e., 1024×1024).

pradyunsg commented 1 year ago

Whoops, indeed. Thanks for flagging that -- I've edited my comment to fix that, since it doesn't change the fundamental argument I'm making in it. 😅

tintin10q commented 1 year ago

@tintin10q wrote:

That's a problem to be solved by the emitters, not by the standard. And it already exists; would we write a thousand like 1_000 or just 1000? Let the TOML writers figure that out.

The fact that the problem already exists doesn't make it ok to make it worse.

While I agree that it is somewhat common to have a need for file sizes, I don't think it's common enough to justify adding this sort of thing -- and it definitely is not worth adding a dedicated type for it.

I really agree with this. Do not make it more complicated. Normal numbers work just fine. Or just put the number of bytes in the tag name.

file-size-mb = 1
eksortso commented 1 year ago

Absolutely no one wants another type for file sizes. Among the proposals discussed, we are no doubt sticking with integers, even if we use floats to get to them.

tintin10q commented 1 year ago

Alright lets close it then

eksortso commented 1 year ago

Well, for all the times I've requested feedback on this feature, nobody's wanted to speak up for it. Maybe KB's, MB's and GB's are useful and worth supporting across the sphere of configuration space. But if nobody's enthusiastic enough to raise their voice for it, there's no point in dwelling on it any longer.

rmunn commented 1 year ago

I want to speak up for this feature. I think the following "solution" is absolutely awful:

file-size-mb = 1

From the point of view of the TOML consumer, the application using this as a config file, this is terrible DX. Instead of just looking for one key called file-size, I (the developer creating this theoretical application) have to look for keys named file-size-kb, file-size-mb, file-size-gb, and so on. Then I have to decide what to do if the user specifies two of them: which one wins?

All of which is absolutely avoidable by simply having a single key file-size and not putting the suffix in the key name. Instead, the only viable way to allow suffixes currently is to use strings:

file-size = "1 mb"

Which then delegates responsibility for parsing the string to the TOML consumer. But if I'm using a config-file parsing library, it's because I didn't want to have to write code to parse strings. So TOML is failing to make life easy for me (again, speaking as the hypothetical app's developer) here.

This is why the file-size suffix idea keeps coming back. Because nothing else in TOML currently can quite replace it.

tintin10q commented 1 year ago

Your terrible dx comes from that you would allow multiple file size inputs for both kb and mb and gb in the config file etc. Just only allow file-size-kb or only file-size-mb.

arp242 commented 1 year ago

Pretty much all of my remarks about durations also apply to sizes: https://github.com/toml-lang/toml/issues/514#issuecomment-1732430974

The proposal I mentioned in that comment also implemented size units (considering both are essentially the same, "a number with a suffix", I felt it was useful to consider both at the same time).

An additional issue for sizes is that there is rarely an obvious type to parse things in to. "Just" parse to an int of bytes won't work, most stdlibs don't have anything for this, so you really do have to implement your own toml.SizeType type, class, struct, or whatever. The best option will probably differ per language, but the implementation considerations for all of this are a lot less obvious than it seems at first sight.

eksortso commented 1 year ago

Let me reiterate that we want to be as obvious to our users as possible. Implementers will have more work to do in order to make something that users can pick up quickly. In short, we should always favor ux over dx if there is a conflict of interest.

Your analysis on #514 was very helpful, though as far as bit and byte sizes are concerned, we're talking about a situation where users and developers are the same people, and you cite popular systems where kb/mb/gb suffixes are supported, even with fractional (float) values using the suffixes. And in such instances, the end type ultimately derived will be an integer and nothing else. That's my interpretation, anyway, and we've discussed implementation details wyd considerable length already.

Are these popular enough that we ought to fold this syntax into the TOML standard for our technical users? Is it worth it, to them and to all of us, for all TOML users to have access to this easy-to-read shorthand?

arp242 commented 1 year ago

And in such instances, the end type ultimately derived will be an integer and nothing else. That's my interpretation, anyway, and we've discussed implementation details wyd considerable length already.

If it's only an int with no additional information, then as a TOML application, how will I distinguish between "user really wants a small cache" vs. "they forget to add a suffix"? In Python all I see is {"cache-size": 1024}, and that's not enough.

And "upgrading" something like cache-size = 1024 to cache-size = 1MB will be impossible, not without everyone updating their configs anyway. Certainly in the first few years the major use case will be "upgrading" existing keys and "just parse to int" makes this effectively impossible.

So you really do need to do something special as applications need to be able to tell if it's a regular number, or a number with a suffix.

It was discussed, sure, but I don't think anyone realized the full details. I didn't either until I actually implemented it (this is why people should really write/prototype code instead of discussing implementation details in the abstract).

That's not necessarily a show-stopper, but it does make things a tad harder, especially for application authors, and it's all a bit non-obvious at a glance. At the very least the changelog for this should make some notes about this, so implementers are aware of the potential issue before they start, and can communicate it to their users (application authors).

arp242 commented 1 year ago

And "do something special" seems simple, e.g. Python can use:

class Size(int):
    pass

Or something along these lines, applications can then use type(config.cache_size) is tomllib.Size. That seems okay at first sight, but my main fear is that stuff like this will happen in applications, especially in the upgrade-existing-key use case:

>>> class Size(int):
...  pass

>>> def set_cache(v):
...  print(v+1, type(v))

>>> set_cache(Size(42))
43 <class '__main__.Size'>

Because the Size subclass of int gets "duck typed" to a regular int, it will appear to "just work" unless the application author thinks of this possibility, which is easy enough to forget.

So, maybe the simple sublclass isn't sufficient, and you want a Size.value attribute or function?

eksortso commented 1 year ago

My own take is that if file size syntax does not convert to an integer, then we are absolutely making things more complicated than they need to be. We want to be programming language agnostic, so special classes for file sizes are out of the question because they're not obvious.

Let's hold off until after the next release, and we can revisit this in a different light. Unless (as I keep asking) we get a lot more use cases in here to persuade us to move on it sooner than v1.1.0.

mav3ri3k commented 3 months ago

I have used similar thing in two places:

In these applications I absolutely love the ability to use tagged integers. Both of these allow for interacting/validating/working with data which is then presented as static config format in the final step. TOML is the format to represent the final step. The validation step should not be part of a static config format. That should be upon the application layer to validate.

I think the best articulated view for my argument would be Cap'n Proto: FAQ where the author explain why "Required" is not available in Cap'n Proto, with a similar argument. It boils down to keeping the layers of problem separate.

I don't like the idea of adding the feature by principle, but it would definitely be a nice to have.

ccuser44 commented 3 weeks ago

TOML is supposed to be simple without bloat. A specific type for filesizes just doesn't make sense. Intigers suffice for filesizes