qwertie / ecsharp

Home of LoycCore, the LES language of Loyc trees, the Enhanced C# parser, the LeMP macro preprocessor, and the LLLPG parser generator.
http://ecsharp.net
Other
177 stars 25 forks source link

Proposal: BigInteger literals #40

Closed jonathanvdc closed 8 years ago

jonathanvdc commented 8 years ago

I'm proposing to add BigInteger literals to LES. First, I'll describe my use case, and then I'll propose a syntax for these literals.

All right, so I've recently added (theoretical) support for arbitrarily-sized integer types to Flame. There's just a small catch: the new integer types don't play nice with on-disk Flame IR, because those IR files are LES-based. And LES only supports the C# primitive types. For example, I can encode UInt64.MaxValue as a ulong literal right now, but there's just no appropriate literal type for UInt128.MaxValue at the moment. A hypothetical BigInteger literal, on the other hand, should be capable of handling any integer size.

I could solve this problem with an ugly workaround, and encode BigInteger values as byte arrays. But I'd really rather not, because that would only make reading and writing IR files harder.

So instead I propose to add built-in support for BigInteger literals to LES. These are then to be represented as normal integers, with some suffix that marks them as BigInteger values. I was thinking of using 'B' for this purpose, but any suffix will do, really.

The UInt128.MaxValue constant would, under this scheme, be encoded as 340282366920938463463374607431768211456B.

Do you think that this proposal has merit? If so, would you mind if I implemented this feature and sent you a pull request?

qwertie commented 8 years ago

Yeah, I'd like to support BigInteger (as an optional item, not mandated in the LES standard.) BTW, what's UInt128, is it your code? Loyc.Essentials currently has 128-bit math ops but no corresponding structs.

I don't think B should be used as a suffix, because it's not compatible with hex notation. Perhaps g (as in "big") or lg (as in "large") or even z (as in the set of all integers ℤ)? LES doesn't support byte arrays anyway, so yeah, BigInts are best. By the way, LESv3 is getting support for arbitrary numeric suffixes, currently using the new SpecialLiteral structure, so that unknown suffixes can round-trip. It would be smart to reprogram the lexer to use BigInteger for any integer that doesn't fit in long/ulong, regardless of suffix. It might hurt lexer performance in general, but I can accept that. (Let's see... BigInteger is missing from .NET 3.5 but Theraot.Core has it, so that's fine. Note, sooner or later I want to move to PCLs and drop .NET 3.5 at the same time.)

qwertie commented 8 years ago

You can implement it if you want - I'll keep busy with LESv3 in the meantime. One approach would be to rewrite this method in ParseHelpers and then fix all the resulting compiler errors:

static bool TryParseUInt(ref UString s, ref ulong result, int radix, ParseNumberFlag flags, out int numDigits)
// new version:
static bool TryParseUInt(ref UString s, ref BigInteger result, int radix, ParseNumberFlag flags, out int numDigits)

When it comes to truly huge numbers, this algorithm isn't efficient, though... but I don't know what the efficient algorithm is. Maybe BigInteger already has that algorithm, but it won't support embedded underscores or take a UString as input.

qwertie commented 8 years ago

P.S. I just pushed some stuff to master including the LESv3 parser.

jonathanvdc commented 8 years ago

Actually, Flame doesn't use UInt128 (or any non-primitive fixed-size integer type, for that matter). But it can compile and optimize programs that use UInt(n) or Int(n) types, where n is any positive, non-zero integer. SByte, Int16, Int32 and Int64 may be special types to the CLR, but to Flame they're just specific instances of a more general pattern. This, in turn, means that a Flame IR tree can contain integer constants that are not CLR primitive types. But I also need to encode that IR tree as LES to store it. Hence the bigints.

Note that any integer type also requires back-end support at the moment - the CLR back-end certainly doesn't accept 128-bit integers - but perhaps a lowering pass could dispense with that restriction in the future.

I like the z suffix. by the way. Seems easy enough to remember. And arbitrary numeric suffixes are music to my ears. That's actually a really cool feature.

Using BigInteger when integer literals don't fit in a long/ulong value makes sense to me, but I don't like hurting lexer performance for all integer literals to add support for a ridiculously large numbers, which are far less common.

So I have a slightly different implementation strategy in mind: have the lexer use the ulong-version of TryParseUInt first, and whenever TryParseUInt overflows, the lexer can try to parse the literal as a BigInteger, with a tweaked version of TryParseUInt. Unless, of course, the integer literal has a z suffix, in which case we'll have the lexer parse it as a bigint from the get-go. That way, we can avoid penalizing normal-sized (<= 64 bits) integer constants, and still have support for huge integer constants.

The LES printershould append a z suffix to all BigInteger constants anyway, which means that the parser won't have to parse BigInteger literals twice (which would be the case if they weren't suffixed). So this implementation shouldn't shift the performance burden to bigint parsing, at least for automatically generated code. And I suspect that few people will write integer literals by hand that are don't fit in a 64-bit number.

Would that be acceptable?

qwertie commented 8 years ago

That all sounds good to me (z is easy to remember? well, to a math wonk... or a zebra) Of course, if EC# were complete there could be one generic TryParseUInt that can produce a uint, ulong or BigInteger, but in the meantime you could copy & paste & modify.

qwertie commented 8 years ago

P.S. if you could only have one syntax, which do you like better, 123'456'789 or 123_456_789? Also, I'm thinking of dealing with numeric types as a special case of strings. Specifically, for LESv3 I find it tempting to define 123.456e7foo as equivalent to foo"123.456e7".

jonathanvdc commented 8 years ago

I like to think that 123_456_789 is more readable than 123'456'789, but I'm fine with either syntax.

About treating numeric types as strings, isn't that what the SpecialLiteral structure, which you to referred earlier, does? I like that. It gives LES consumers and produces a fine-grained level of control over how they encode literals, without requiring intermediate steps to understand what the custom literals mean.

The only downside to this that I can imagine, is that encoding custom literals as strings may be wasteful for binary formats. I don't think that's much of a show-stopper, though, and I don't see any real way around that.

jonathanvdc commented 8 years ago

BigInteger literals have been added to LES, so I'll close this issue now. Thanks!