range type - Githubissues

toml-lang / toml

Tom's Obvious, Minimal Language

https://toml.io

MIT License

19.38k stars 845 forks source link

range type #689

Closed alan-isaac closed 2 years ago

alan-isaac commented 4 years ago

I'm new to TOML and really liking it. The one thing I'd really find helpful is a range type, which implementations could interpret either as a range object (e.g., Python) or as an explicit array, depending on the language. I anticipate "just use a 3-array" or "just provide start, stop, and step attributes" as responses, but if you search you'll find that YAML and JSON users also request ranges from time to time. So I think there is a desirable feature here. I'm not going to suggest syntax but array syntax without commas [0 10 1] or doubled periods 0..10..1 or even Mathematica span style 0;;10;;1 pop into mind.

abelbraaksma commented 4 years ago

Perhaps because often in this discussion we refer to how things are done in programming languages, we lose sight of a key thing: a concise way of expressing arrays, that are a natural sequence of numbers. Just like a time-span is a range of time.

Having such expression doesn't diminish the declarative nature of TOML, in my opinion, not does it magically turn it into a programming language, far from it. It's basically just a different way of writing the same thing, but clearer than reading or writing, say, 100 numbers.

The confusion comes perhaps from the idea that ranges are often used in loops in programming languages, but the proposal here does not intent to apply the range result. It is not a branch or loop instruction.

To summarize:

foo = [1,2,3,4,5,6,7,8,9,10]

Is exactly equal to (assuming one variant of the syntax):

foo = [1..10]

The main difference being esthetics, number of keystrokes, being prone or not to errors, clarity of intent, and readability.

llacroix commented 4 years ago

I read the following issue and I'd want to add my 2 cents. As a toml user, if we can say that, I really like toml because it does what it claim to do and does it very well.

Being so minimal allows it to be able to be used as a drop in replacement for json/ini/... without much issues. From my point of view, having a range/slice type is a bit similar to having a datetime type. The datetime type is a pain to implement because we live in a world with timezones, offset and dst. There are so many ways to represent a date and as many ways to do it wrong that can end up very badly.

That said, this limitation of JSON doesn't prevent developers to use json. Special types can be handled as substructure of a json object. The same thing can be done in toml and as for the range/slice type.

Saying that you want to have range being understood in any language is a nice thing but a file format contains data and how the data is read / interpreted by an application is a whole different thing.

So taking this example:

foo = [1..10]

There are at least 2 possibilities to interpret this:

It's generating a list from 1 to 10
It's generating an Range type that lets you iterate a value from 1 to 10

If you generate a list from 1 to 10, you open toml with side effects like a file having a wrong range definition that would expand to a few terabytes ram being used. That's not really nice... Also it would make it a bit difficult for storage. So a range should be a dumb object that can be used to create a generator like in python3. In order to create a list from 1 to 10 in python3 you'd have to create a range for range(1, 11) but in python2 (while it's supposed to be unsupported as of today if I'm not mistaken), a range is a function or a generator. In javascript, it doesn't exist so it would have to be implemented and in other languages there may be some support for range but chances are a parser would have to define a custom type in many languages in order to support the extension. Just that means that the file format wouldn't be a file format anymore but starting to get into the "programming language" extension where it would have to define foreign types. That's one good way of ending with flacky support of toml in different languages. Having multiple version of toml on a same language because a design choice in implementation didn't please someone else. It's not like the Date type which is hardly a foreign type to any language out there.

In python3, I could handle the range issue with this format for [1,10]:

foo = [1, 11, 1]
ids = [x for x in range(*data['foo'])]

But yes, if you need to use it in an other language, you'll have to know what which parameter should be so if 1 and 11 are inclusive or not but as you define the format of your file, it should be in the file format (application level) not file format (transport level).

But nothing prevents you from having helper methods like:

array_to_range(array) -> range
range_to_array(range) -> array

This way you can have consistent way to parse substructures in toml which are specific to your application. Having it part of the file format, could be useful but it's so trivial to implement that I wonder if it's worth the hassle to have it being part of the language as it also comes with incompatibilities and sacrifices.

For example Rust doesn't seem to support a step like python3, in ruby you can include or exclude the last element, in C#, the range is a start:count parameters so 2,10 would yield [2,3...10,11] , so it means we can't have negative values for the second parameter. Java support for range seems to exists in many form but I couldn't find one supposed to be used to generate a list as an interator so it might be implementation specific like javascript.

On other thing is that choice design on how to implement the range could be influenced by the use, is is threadsafe, can it be used as an async iterator? I think it should be left to the user to implement it for their application.

Saying that you should be able to read file between languages is barely an argument as if I open a file x.json in one python app, an other ruby app won't comprehend what to do with the file even if json is correctly supported by ruby. So even if toml supported ranges, it wouldn't make all apps magically understand your file, you'd still have to implement your app to expect a range or to parse a range. Handling the conversion in your app is putting the maintenance burden on yourself and having it inside toml is putting the burden on every implementer of the toml format. It would suck to be unable to open a file on a certain language because implementer couldn't decide how to implement ranges in a way it pleases everyone. When you can just put an array or a object and call it a day.

alan-isaac commented 4 years ago

@llacroix I suggest that the most basic question is simpler than you appear to believe. The question is simply whether TOML files should have a syntax to more simply describe a certain simple and common type of array, which would otherwise have to be typed out explicitly.

For this basic question, your most powerful argument is that the syntax is so flexible that a TOML file might cause a parser to generate a dismayingly large array. This will not affect parsers that choose to return a range object of some type, rather than explicitly constructing the array. Worrying about such parser decisions is like worrying about whether the current Python parser should return a list, a tuple, or an array. I think it is out of scope?

The need for this is not for sharing across instances of a single application but for sharing certain kinds of configurations across diverse applications. A good point of reference is thinking about how TOML files could be used to configure print jobs by providing an array of page numbers. The configuration should be parsed to the sequence of page numbers, not to an object that could be turned by an app here or an app there by not all printer apps into this sequence.

But again, the question is just one of convenient syntax. Just as you do not insist that TOML users represent number literals with hex notation, why should they not more simply represent certain common and simple array literals?

llacroix commented 4 years ago

But again, the question is just one of convenient syntax. Just as you do not insist that TOML users represent number literals with hex notation, why should they not more simply represent certain common and simple array literals?

Because range aren't literal for arrays, they're pretty much literal for control flow. They are used to prevent loading a complete data structure in memory. You can accumulate all the values or reduce them to a single one.

In other words, range is the functional version of

for(i=start; i<stop; i+=step)

This will not affect parsers that choose to return a range object of some type, rather than explicitly constructing the array. Worrying about such parser decisions is like worrying about whether the current Python parser should return a list, a tuple, or an array. I think it is out of scope?

It's not out of scope as if the parser can return different types, it would make it difficult or even impossible to load certain file in some languages as range != array.

Let say you have a file that looks like this:

[job.a]
pages = [1:3]

[job.b]
pages = [1:2, 5:7]

[job.c]
pages = [1:2, 10]

In this case, job.c.pages would have a list of [range, int] Which is incompatible if you load a List<Range> for example. The other would be a list of ranges only. If you wanted to explode them into list you'd have a List<List> and in the last example you'd still have an int in conflict so you'd have to write this instead.

[job.c]
pages = [1:2, [10]]

To be able to do something like this:

for page_ranges in data['pages']:
   for page in page_ranges:
      do_something(...)

But lets put aside the memory limit and OOM killer issue and let say we want to explode range and join list together to have this [1:2, 5, 7:10] explode into [1,2,5,7,8,9,10] then we could in theory loop over all the elements as if it was a list... But like I said earlier if you do that, you're not capable of serializing it back into a range.

With that file:

[stars]
indices = [1:100000000000000000]

Let's imagine you do that:

# load the file and explode as list
x = toml.load("file.toml")
x['stars']['good'] = [1, 2, 3]
toml.dump(open('file.toml', 'w'), x)

If those things are loaded as array literals, they're going to be saved as array of int. Thought the data didn't really change but you start from a file with a few bytes to a couple thousands of bytes just because expansion loose the data it stores because it can't know what was the range previously stored in the file if it expand it.

And that would be difficult to handle correctly because if you can expand it, you're either breaking the ram (you're certainly going to go out of memory) or you need to keep it as a range and handle it as completely different type as they're not array literals but in that case you're getting hit by platform limitations in implementation specific ways. I mean even if we could set a limit on the parser to limit 1 expansion to 1000 elements, it doesn't prevent a malicious user to input 1000 ranges expanding to 1000 elements. And all those checks add complexity to a parser and you only need to forget one case to hope it was never implemented.

Also it's not very typical to manipulate ranges directly in code. For example, I don't see why in code I'd build something like this:

pages = [range(1, 10), 1, range(13, 45)]

The only way I see how it could make sense is if you received an input text as toml to be parsed to [range(1, 10), 1, range(13, 45)] but as you suggest, it would return a list of int anyway so if you had

pages = pages_from_toml()

It would always store a list of int, and your output file would always have the exploded version of an array literal. So in order for a software to output ranges you'd have to manually do something like this

pages = []
for rarr in ranges:
    pages.append(range(rarr.start, rarr.stop))

How is that easier to write than:

pages = []
for rarr in ranges:
    pages.append([rarr.start, rarr.stop])

Programming wise, it's not particularly different, a range is really just a start stop and possibly increment.

alan-isaac commented 4 years ago

Your (@llacroix) first comment completely misses the point. A range syntax will be an array literal if TOML says so. It is that simple. This is a very simple point. I am not understanding why it is repeatedly ignored. @abelbraaksma has explained this multiple times, and this observation is no more than a standard CS use of the term "literal". It is just a matter of convenient syntax.
I will repeat the other simple but apparently misunderstood point. The goal is not to have a TOML file plus schema that together can be used to produce a configuration. (I am not just noting the lack of a TOML schema framework.) The goal is to have a TOML file that directly represents the configuration. That is, it is to facilitate the direct use of handwritten TOML files as configuration files, which is a common use for them. Each suggested workaround completely misses the point. Both @abelbraaksma and I have tried very hard to draw this distinction, but neither you nor @ChristianSi have offered any indication of understanding what we are trying to say. That may be the fault of our communication skills -- I for one do not have formal CS training -- but surely you can overcome our shortcomings in that area.
The worry about exploding array sizes is something of a red herring. Right now, a TOML file has no restriction on array sizes, so I can already send a file will enormous arrays. Of course the difference is that right now the TOML file would have to be correspondingly large. If this difference is seen as significant, then the size of arrays specified with a range notation could be limited (e.g., to 1000 items). But seriously, if array size is a concern, then parsers should protect against that no matter what the source, so that discussion should be entirely separate.

RedHatTurtle commented 3 years ago

Similar to #428 ?

JeppeKlitgaard commented 3 years ago

I think this is the case where a good feature might be missed due to some partisan entrenchment that has bubbled up over the course of the discussion.

As someone with no horse in this race, I think that a range should only be added if it is truly a datatype.

That is, the TOML parser should not evaluate the range and provide an array of numbers. It should provide a range object, which the implementation would then know how to deal with. If this was to be the case, the discussion reduces to whether TOML should provide a clearly defined range type, much like it provides a datetime type.

Assuming that range is now a type that is not evaluated by TOML, but merely parsed, this also requires the implementation to recognize that in almost all cases the field could be EITHER a range or an array of numbers. This likely wouldn't be much of an issue, but does add a bit of complexity to the language.

I think it then also becomes important to distinguish between bounded and unbounded ranges. These would be two separate types, ideally. It is important that an application can know whether a range is bound or unbounded, since in many cases an unbounded range might not be appropriate and lead to a non-terminating execution.

alan-isaac commented 3 years ago

@JeppeKlitgaard Bounded ranges would be great. Personally, I have no need for unbounded ranges or non-integer ranges.

pradyunsg commented 2 years ago

I've just read through the whole discussion for a third time.

I don't think a range type is particularly useful to add on its own -- it's just not as common -- and I don't think that adding a mechanism to have a shorthand for [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100] is particularly useful.

I still think this is best solved on a per-application basis, since this is a fairly niche concern and the inline table syntax is totally up to the task at hand here.

pradyunsg commented 2 years ago

Thanks for the surprisingly passionate discussion on this folks, as well as for your patience on this! :)

alan-isaac commented 2 years ago

This is a perfunctory assessment which seems to be no more than the following: "I don't need this feature, so without further evidence I'm going to call it 'niche' and close this discussion without seriously addressing any of the issues that have been raised."

Disappointing, to say the least.

I try to imagine someone offering a similar case against ranges in any of the many languages where they are a core language feature.

Please reopen this issue for a more serious and objective consideration.

arp242 commented 2 years ago

This is a perfunctory assessment which seems to be no more than the following: "I don't need this feature, so without further evidence I'm going to call it 'niche' and close this discussion without seriously addressing any of the issues that have been raised."

It's been discussed extensively, and the overall concuss seems fairly clear to me. This is not necessarily a democratic vote, but it wasn't "just closed" on the whims of one person.

As far as I'm concerned it's up to you to demonstrate that there is a demand for this feature, not for the TOML maintainers to demonstrate there is no demand for it. And with "demonstrate that there is a demand" I don't mean "it might be useful in this hypothetical scenario" but "here are a bunch of popular projects with TOML configurations that would be made better by this", and similar more concrete stuff.

Remember: almost every single feature that has every been added to any configuration file, programming language, or other piece of software was useful to someone, at some point. I don't find "it's useful in scenario X" on its own to be a very good argument in these types of discussions, as it can be used for everything.

Personally it seems to me the need for such a feature is too rare; I thought about this for a few minutes and the only use-case I can think of is Vim's iskeyword setting. Although I'm sure there are other use-cases, they don't seem common. Plus it can be worked around quite easily with rng = "1..10 20..30", rng = [[1, 10], [20, 30]], or the inline table syntax mentioned in the 2019 discussion. This is perhaps a bit suboptimal, but seems workable enough.

alan-isaac commented 2 years ago

@arp242 Your comments suggest that you have not understood the discussion. They also weirdly suggest that the only use of TOML you can imagine is for project configuration files.

I'm not going to rehash the whole discussion for you, but note that when you say "it can be worked around quite easily" you completely overlook that TOML does not include any facilities for communicating semantics (e.g., a grammar language, like JSON schema). This issue is already raised above, in detail. Just imagine if someone proposed that TOML tables could "quite easily" be replaced with strings or nested lists of strings.

These comments signal only a desire to close this issue for lack of progress, not a desire to actually come to grips with it.

arp242 commented 2 years ago

I said what's needed to move things forward as far as I'm concerned: show popular projects with TOML configurations that would be made better by this. So if you really want this badly I suggest working on that.

They also weirdly suggest that the only use of TOML you can imagine is for project configuration files.

That is its goal.

alan-isaac commented 2 years ago

@arp242 Here is the actual goal statement: "TOML aims to be a minimal configuration file format that's easy to read due to obvious semantics."

This statement encompasses the configuration of simulations.

pradyunsg commented 2 years ago

@alan-isaac I appreciate that you feel strongly about this, and also understand that you're disappointed about the answer/responses here (from the broader group of individuals who've contributed to the discussion here as well as myself). I appreciate and empathize that this would be beneficial for your use case, if this were added.

However, I don't think the arguments made here are compelling and I certainly disagree that this is as broadly applicable as has been claimed in this thread at various points.

Sure, programming languages do have range syntaxes or mechanisms to generate ranges. However, TOML is not a programming language. I'm not aware of any major configuration language that has ranges. Even YAML doesn't have them and YAML somewhat famously has too-many-things.

I try to imagine someone offering a similar case against ranges in any of the many languages where they are a core language feature.

A broadly-applicable programming language? Oh, I'd be very opposed to not having ranges in a programming language. However, TOML is not a programming language.

alan-isaac commented 2 years ago

@pradyunsg Thanks for your reply. You are correct that I am disappointed, but the reason is different. I am disappointed when those who reply have made little effort to understand the use case, which resulted in the justification for closing this issue being completely off base. This lack of understanding was a problem encountered early in this thread and eventually somewhat resolved. Misunderstanding of the use case and mischaracterization of the request leads to proposals of supposedly easy workarounds that entirely miss the point.

Your "TOML is not a programming language" critique, also found above, again misses the point. Nobody has proposed adding say control flow. The only proposal is the addition of a range literal.

So far, the primary critique that has merit is the claim that "there is no obvious syntax". However, the closed integer interval notation from mathematics (e.g., [1..10]) is in fact quite obvious. And in any case, if obviousness were the real objection, then it should set off a search for an acceptable syntax.