ozra commented 8 years ago

Note this issue only goes into the surface of constructs: the lexical aspects, for type-definition, etc. there are separate "Doc / RFC"-issues.

There are some [RFC] markers in this text, those are for lexical elements that are very much up to debate. You can question any of them. But those are ones in need.

Identifiers

Variable and Function Identifiers

my–identifer = "47"  -- here using ENDASH in the identifier

if my_identifer == my–identifer => say "Yep - snake case is interchangeable!"
if my-identifer == my–identifer => say "Yep - hyphens (lisp case) also!"
if myIdentifer == my–identifer => say "Yep - even camelCase works!"

my-fun-with-qmark?(foo) ->
   foo == "Say what?"

bar = my-fun-with-qmark?("Say what?")  -- => true

Internally the separators are all represented the same way and therefore comparable.

I decided to transparently support (–|-|_) (ENDASH (\u2013), hyphen, underscore) interchangeably as "word-delimiters".
- For the choice of ENDASH - of all available unicode characters - this came about through a long heated debate when I was hacking on the Nim compiler. If you want some motivations you can probably find that discussion with a search
Since there is a big crowd favouring humpNotation, I've now added experimental support for that too, though not being a fan, but for completeness in this sandbox-try-everything-out-phase of Onyx.
The rules for camelCase are, a capital latin alpha is translated to '_' + lcase(char) - keep that in mind!
This behaviour was inspired from Nim. There is a major difference though: Nim throws away the delimiter-information (case insensitive past the initial character), in Onyx a delimitation is a delimitation, it can simply be expressed differently.
To maintain Crystal-compatibility: Callable identifiers may end with ? and !.
The (?|!) ending characters are definitely up to debate in entirety - especially since nil-handling sugar using identifier?method-to-call-if-not-nil() and identifier!method-to-call-if-not-nil-otherwise-throw() will be implemented (if no better idea turns up?). The favoured idea in my head atm is that the nil-handling notation favours callables ending with ?, and if not defined, looks for one without it.
Quick (Biased) Comments
There are some studies that taken together point to dash-notation being better for comprehension, snake_notation second and humpNotation a far down third.
- There are some studies pointing the other way, but examining them, they were rather flawed in their design.
- Current studies must be seen as rather inconclusive.
- As with all things programming there is a big gap in the scientific community - any more studies that might apply would be welcome!
I personally warmly recommend the ENDASH or lisp-case notation. After 28 years of coding, with a "always looking for the better way"-mindset and trying all kinds of styles, that's the one I keep favouring more and more. Much more fluent. The ENDASH "grabs" the neighbouring chars more, and might look more fluent than hyphens in many (monotype) fonts. This is however entirely up to your project's style specification. Personally, in practice, I end up using hyphen.
Types

type MyType
   some–member Int32 = 47
end

my-type-instance = MyType()

Type names are always initial capital

Pros

Visual disambiguation - good for human
Less ambiguities already at parsing stage - good for compiler speed
Cons
For a big team / company with a mother tongue written in non Latin letters this means that iff you'd like to write the code in your own language, types must begin with, say, latin T or similar ad hoc scheme (we can't practically consider capitals in every possible script, and there's not the notion of captials distinction in all).
- It could be argued that english is the lingua franca, especially in programming contexts, and as such it should be favoured as basis for symbols in source code. (I'm not native english speaking myself, although do have a latin alhabet superset - which makes me biased)
- If it would be of great benefit and regarding a major language (Japanese, Chinese, Russian, etc.) this could be reconsidered in the future, if not too intrusive - also: when to stop adding? Pandoras box...
- I hold the position for now that pseudo-english should be the lingua-franca of programming, and thus the scheme holds water.
  Prior Art, Preference and Motivation

This idea has followed my language design ideas for about a year now. When I stumbled upon Crystal, I saw it used the scheme also. Click. As Crystal now is in the family from AST-level down to LLVM - this is set in stone.

This, along with constants, also form the notion "capital initial letter = compile time fixed symbol".

Constants

MyConstant = 47

type Foo
   MyFooConstant = 47
end

Constants, just like types, are a compile time construct, constant at run time. Therefore they share the lexical notion of capitalization.
This lets us get rid of having to specify const constantly [hehe], keepig source cleaner and more focused on logic terms than formalia word noise.

Constants aren't "dangerous", they're the "safest part" in code, so why should they have a "shout out look"? Well, they have a more "formal" importance. If you see code compare x to thingie. What is thingie now? But if x is compared to Thingie, you now that Thingie is a formalized important concept. So it does hold higher system wide importance. I believe this justifies capitalization apart from it's status as compile time constant, which in that regard is a less important factor. In addition it helps speeding up compilation time.

If you're hell bent on having some constant lower case, you could wrap it in a function - compiled in release mode this will be the exact same machine code and exact same speed:

CRUDE_PI = 3.14
crude-pi() -> CRUDE_PI

say "Hey, my lowercase 'constant': {crude-pi * 2}"

Global Variables

-- currently:
$my-global = 47
$my-thread-bound-global = 42 'thread-bound

Globals are prefixed with $ - because it looks ugly (and makes them pop out clearly)
Namespacing Paths

You simply use dots: SomeModule.SomeType.a-func()

Literal Values

Numbers

my-int-number = 47
my-real-number = 3.1415
my-hex-int = 0x2f

my-literal-typed-int = 47u64

-- likely future idiomatic way (no longer typed at literal)
-- my-literal-typed-int U64 = 47

my-big-number = 1_000_000_000  -- underscores can be used to clarify

[RFC] The literal typings will be removed. Currently a literal int is typed StdInt by default, and then if assigned to a var that is typed as, say UInt8, it fails because of type mismatch - which is ridiculous from a human being's perspective. The type inference will be improved for this - just have to figure out the "right way" to implement it conceptually.

The data type is StdInt* for integer literals by default. The data type is Real for real literals by default.

(*) Note StdInt will be changed to be called simply Int, provided coordination with Crystal team holds.

The data type used for the literals can be changed, either explicitly as above, or through parse-pragma: 'int-literal=BigInt - this would cause any literal integers to produce BigInts instead. 'real-literal=FixedPoint[4] - you get the picture.

The variables in the above examples are inferred to the type of the literal - they're not dynamic.

Tags (aka Symbols)

my-tag = #some-tag

my-fun(foo Tag) ->  -- note, you don't have to specify the type - inferred!
   case foo
   when #some-tag    => say "It was some tag"
   when #other-tag   => say "It was other tag"
   else              => say "It was {a} - which I don't recognise"

my-fun #funky-tag  --> "It was funky_tag - which I don't recognise"

Tag (think "hash-tags"...) are unique program-wide, they get a unique Int32 number internally, and so are very efficient. Preferably you should use enums, but in some cases, just having ad hoc tags is very convenient. As easy as using strings as identifying tokens; but with the performance of an Int32.

Strings

my-string = "A simple string"
my-interpolated-string = "Interpolation: {my-string} with sugar on top!"
-- any kind of expressions can go in the interpolation brackets of course!

the–str = "111kjhgkjh" \
   "222dfghdfhgd"

--> "111kjhgkjh222dfghdfhgd"

yet-a–str = "111kjhgkjh
   222dfghdfhgd
   333asdfdf
"
-- above preserves the white space and newlines

my-straight-str = %s<no {interpolation\t here}\n\tOk!>
-- for the %* string notations, you can pick your delimiter chars yourself,
-- which ever makes the particular string clearer: `<...>`, `(...)`, `{...}`
-- or `[...]`:

The data type is Str / String

Chars

my-char = _"X"

[RFC] Chars are no way near common enough to warrant wasting a unique symbol on (like single quote for instance, which has several other, more important, functions in Onyx).

Was first c"X", but then changed to %c"X", which follows the pattern of the other "special string literal notations", but I decided to at least give chars some special treatment, going `%"X", but after some use, it looks noisy, so tests underscore variant now.

Regular Expressions

my-regex = /^match-stuff.*$/i
match = my-regex =~ "does this match?"

The =~ above is of course a generic operator that can be implemented for other purposes for other types.

A consideration could be to change the syntax to prefixed-string, like Char:

my-regex = r"^match-stuff.*$"i
match = my-regex ~= "does this match?"

However, in much network programming, which is quite common, regexes serve a steady role, do explicit sugar syntax for them seems warranted.

The resulting type is Regex.

List - a dynamically growing (Vector, Array, Seq, Sequence, etc. in other languages)

my-list = [items, go, here]
other-list = [
    "a string"      -- commas not necessary if newlined
    47,             -- but are allowed
    1.41
    ["nested list", "ok, duh!"]
]

-- type of above is List< Str | StdInt | Real | List[Str] >

an-empty-list = [] of Int   -- empty list has to be typed (since there are no
                            -- values to infer type from)
another-empty-list = List[Int]()  -- same result as previous line

For details on List vs Array see issue on basic data types: #***XXX.

You can make Listish literals with arbitrary type also, see Set for notation.

As is obvious by now: the resulting type is List, where T can be a sum type.

Tuple

my–tuple = {"foo", 1, #bar}

[RFC] It is desirable to use (items, here) notation for tuples, because braces are never used for tuples in mathematical notation. It does however make syntax a lot messier, since both expression grouping and lambda-parameters use parentheses. The current tuple notation would be much better of used for set-notation!

Set

my–set = Set{"foo", 1, #bar}

Any type can be used as prefix as long as it implements the [](ix) method, this is therefor a generic "listish" syntax. [RFC] Set unfortunately doesn't have it's own literal for now (compare tuple above).

Map - Hash Map

string-keyed–hash = {"foo": 47, "bar": "bar value", "qwo": ["a", "list", "here"]}
tag-keyed-hash = {
    #foo: 47    -- commas not necessary when newlined
    #bar: "bar value"
    #qwo": [
        "a", "list"
        "here"
    ]
}

string-keyed-hash-js-style = {
  var_name: "a value"
}

some-var = #a-key
other-var = "another key"

variable-keys = {
  some_var => "some value"
  other_var => "other value"
}

-- type of above would be {Str|Tag => Str}

~~[RFC] Note, I will change the syntax for:~~ [ed: this is changed now / 2016-03-25] {key_here: value_here} - it currently parses it the same as key => val notation. I will change this to follow Javascript JSON variation: key_here considered a literal string. This might facilitate network coding working with JSON's a lot, since you've then essentially got JSON-syntax in Onyx (but strongly typed!).

I've probably forgot something, just tell me.

stugol commented 8 years ago

You mention "camelCase" and "humpNotation". Are they different?

I'm in favour of nil-handling sugar; but I'm not sure how well it'll interact with the ? method suffix. Maybe require ?? if the method ends with ?. fn?.fn is preferable to fn?fn, in my opinion.

I'm in favour of dashed identifiers, and optional commas in arrays. I suggest both regex syntaxes.

Implicit string literals in hashes is an interesting idea.

ozra commented 8 years ago

@stugol

Ah, no, I'm used to calling it humpNotation, but saw that camelCase is more common, so I've tried sticking to that term instead, but - slip of habit.
Regarding nil-handlind sugar, it's good to discuss in #21 :-)
Both regexp-notations might be worth considering, don't know what value it would add, but it would be easy to implement.

stugol commented 8 years ago

I notice %s{ ... } is non-interpolating. Is %{ ... } interpolating?

ozra commented 8 years ago

Yes that's right: %(...) etc., is for using other delimiters for the string, as in Crystal. %s(...) etc., is flagged "straight string" - no interpolation.

stugol commented 8 years ago

Good. Asterite refused to implement non-interpolated strings. Sigh.

ozra commented 8 years ago

As of today: Char-syntax changed from %"X" to _"X".

Sod-Almighty commented 8 years ago

...why?

ozra commented 7 years ago

Looking through code, it simply looked veeery noisy with the percentages. I must admit, it wasn't very thought through.

I think, I'll follow the motto used up till now of expanding choices first, and then reducing options to what becomes preferred, only after some time of side by side usage. So I'll re-introduce the old syntax again, for evaluation, and to continue the de facto devlopment methodology of Onyx.

ozra commented 7 years ago

As of now: Char-syntax %"X" re-introduced. Both are now available in order to evaluate and compare.

ozra commented 7 years ago

Hmm, I still want to allow 0 - n prime symbols at the end of identifiers (have wanted that since the beginning, but put it of again and again because of fear of confusion, but I think it's very moot!): Hmm, better put this in it's own issue first.

ozra commented 7 years ago

For previous comment: #95. Link here in regard to Lexical and Literal aspects of lang.

ozra / onyx-lang

Basic Lexical Elements and Value Literals of Onyx #9

Identifiers

Variable and Function Identifiers

Quick (Biased) Comments

Types

Type names are always initial capital

Pros

Cons

Prior Art, Preference and Motivation

Constants

Global Variables

Namespacing Paths

Literal Values

Numbers

Tags (aka Symbols)

Strings

Chars

Regular Expressions

List - a dynamically growing (Vector, Array, Seq, Sequence, etc. in other languages)

Tuple

Set

Map - Hash Map