roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.46k stars 313 forks source link

Add `Str. fromUtf8WithFallback` to the builtins #7117

Open lukewilliamboswell opened 1 month ago

lukewilliamboswell commented 1 month ago

Refer to this zulip discussion for more context

Str.fromUtf8WithFallback : List U8 -> Str

This will replace anything that can't be represented in a string with the Unicode Replacement Character (U+FFFD) �

HazemAbdo commented 1 month ago

@lukewilliamboswell I am interested in this task even though I have simple theoretical knowledge of compilers and language design. Still, I hope this issue will be my first contribution to this field.

HazemAbdo commented 1 month ago

I want to do it under Hacktoberfest label and would start when Hacktoberfest starts.

Anton-4 commented 1 month ago

Thanks for helping out @HazemAbdo!

sajjaduc commented 1 month ago

I want to help here as well @HazemAbdo if possible.

I was reading up on unicode's supplementary planes and also looking at if Zig's std library (utf8Decode) can help us here instead of running a loop with tracking bytes/states inside. But this is my first time diving into the internals of Roc as well. If possible can we start a Zulip thread for this. (So far while playing around with internals the tests on the main branch are failing on me). If nothing else I would like to shadow you on this and learn.

test result: FAILED. 133 passed; 10 failed; 2 ignored; 0 measured; 0 filtered out; finished in 50.15s

sajjaduc commented 1 month ago

I was playing around with Zig to implement a simple function to test the examples provided in the Rust function.

And here's a simple way to achieve what we want (I think). I want to test the performance against decoding of the bytes manually, rather than relying on code points. But before that I want to ideally implement a solution end to end for Roc before any optimisations.

const std = @import("std");

pub fn fromUtf8Lossy(allocator: std.mem.Allocator, bytes: []const u8) ![]u8 {
    var result = std.ArrayList(u8).init(allocator);
    errdefer result.deinit();

    var utf8 = std.unicode.Utf8View.initUnchecked(bytes);
    var iterator = utf8.iterator();

    while (iterator.i < bytes.len) {
        if (iterator.nextCodepointSlice()) |codepoint_slice| {
            try result.appendSlice(codepoint_slice);
        } else {
            // If we encounter an invalid sequence, append the replacement character
            try result.appendSlice("�");
            // Skip the invalid byte
            iterator.i += 1;
        }
    }

    return result.toOwnedSlice();
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const bytes = "Hello \xF0\x90\x80World".*;
    const result = try fromUtf8Lossy(allocator, &bytes);
    defer allocator.free(result);

    std.debug.print("Converted string: {s}\n", .{result});
}
sajjaduc commented 1 month ago

I was able to resolve the issues with the tests, thanks to @Anton-4

@HazemAbdo I hope this helps, if you are on Mac

Zulip thread

I will wait to hear from you @HazemAbdo before proceeding any further. But this is the kind of issue which has really called out to me for some reason. But I am happy for you to take the lead as I just want to complete the whole cycle and learn how everything gets stitched together end to end from Zig to Roc.

HazemAbdo commented 1 month ago

@sajjaduc I will start my investigation of this issue and your work next Saturday. It will be great if you wait for me as I want to learn from your solution and have discussions with you.