roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.1k stars 289 forks source link

Implement builtin number parsing #7010

Open lukewilliamboswell opened 3 weeks ago

lukewilliamboswell commented 3 weeks ago

See zulip discussion for background discussion.

Builtins

Str.dropFirstBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the first character of the new string
Str.dropLastBytes : Str, U64 -> Result Str [BadUtf8] # O(1) builtin which just has to UTF-8 validate the last character of the new string

NumParseError : [OutOfRange, NotANumber]

Num.parseUtf8 : List U8 -> Result {output: Num *, rest: List U8} NumParseError # Should be a lowlevel so we can get different code for each number type
Num.parse : Str -> Result {output : Num *, rest : Str} NumParseError # Pure Roc wrapper around Num.parseUtf8 using Str.dropFirstBytes
timotree3 commented 3 weeks ago

(Maybe we don't need Num.fromStr. I didn't realize we already had Str.toU32, Str.toF32, etc.)

timotree3 commented 3 weeks ago

To elaborate on how Str.dropFirstBytes and Str.dropLastBytes work, it's safe to slice a valid utf-8 string as long as you don't chop any characters in the middle. You can tell you're chopping a character in the middle because utf-8 bytes that are in the middle of a character versus the beginning look different. Bytes that look like 0xxxxxxx or 11xxxxxx always mark the beginning of a new character, and bytes that look like 10xxxxxx are always continuing a character.

Here is some Roc pseudocode for the Str.dropFirstBytes and Str.dropLastBytes

isSafeSplitPoint : Str, U64 -> Bool
isSafeSplitPoint = \s, index ->
    when Str.toUtf8 s |> List.get index is
        Ok b -> Num.toI8 b >= -64 # This is bit magic equivalent to: b < 128 || b >= 192 (Copied from Rust stdlib `is_utf8_char_boundary`)
        Err OutOfBounds -> Bool.true # Splitting a string at a point past the end can't break apart two characters

dropFirstBytes : Str, U64 -> Result Str [BadUtf8]
dropFirstBytes = \s, n ->
    if isSafeSplitPoint s n then
        s
            |> Str.toUtf8
            |> List.dropFirst n
            |> Str.fromUtf8Unchecked # (We don't actually have Str.fromUtf8Unchecked in Roc, but this is Roc pseudocode...)
            |> Ok
    else
        Err BadUtf8

dropLastBytes : Str, U64 -> Result Str [BadUtf8]
dropLastBytes = \s, n ->
    if isSafeSplitPoint s ((Str.countUtf8Bytes s) - n) then
        s
            |> Str.toUtf8
            |> List.dropLast n
            |> Str.fromUtf8Unchecked # (We don't actually have Str.fromUtf8Unchecked in Roc, but this is Roc pseudocode...)
            |> Ok
    else
        Err BadUtf8