metaeducation / rebol-issues

6 stars 1 forks source link

Define a WHITESPACE charset #2189

Open rebolbot opened 9 years ago

rebolbot commented 9 years ago

Submitted by: Ladislav

I think that it is useful to have it defined. It seems to be used frequently enough to justify the need.

whitespace: charset [#"^A" - #" " #"^(7F)" #"^(A0)"]

CC - Data [ Version: r3 master Type: Wish Platform: All Category: Parse Reproduce: Always Fixed-in:none ]

rebolbot commented 9 years ago

Submitted by: fork

(Hi Ladislav nice to hear from you, do check in on chat sometime if you have a moment...)

It's a very crucial idea to predefine character sets, especially when advocating for the ease of use of PARSE. There has been significant discussion on how to do it. The Unicode standard actually has character classes, and it would be desirable to be able to offer sets for them:

http://www.fileformat.info/info/unicode/category/index.htm

The concept of defining it as a function is a nice one; it would for instance allow whitespace to be meaningful as well as whitespace/ascii. It also allows the sets to be generated and cached on demand. You could use it in FIND or PARSE or whatever...

...however it will not work with PARSE unless PARSE allows function evaluation. I added it in a PR, it's certainly possible. But at one point I thought arbitrary evaluation with function parameters would be okay if the parameters wound up inline with parse dialect code. I now agree with Carl's feeling (and others) that only zero-arity functions be allowed inline in parse code. Under that premise this would be legal:

    some-rule: function [/b] [
       either b [[some "b"]] [[some "a"]]
    ]
    parse "aaaabbbb" [some-rule some-rule/b]

While this would be rejected, and hit an error on the first attempt to use a non-zero-arity call:

    some-rule: function [value [char!]] [
        compose [some (value)]
    ]
    parse "aaaabbbb" [some-rule #"a" some-rule #"b"]

I've written up a deeper rationale behind why this is not a loss of meaningful generality--with the benefit of not making PARSE rules any more nuts than they can get already. :-)

Surveys of our proposals for these classes can be found in chat search, so if you stop by we can dig up what those were. Offhand I believe we were going with digit, letter, whitespace, symbol...with refinements on each to do narrowing. so letter/latin8/uppercase would be more specific, while letter would be very general and match anything in the unicode spec that was a letter.