Open lukewilliamboswell opened 1 month ago
@lukewilliamboswell I am interested in this task even though I have simple theoretical knowledge of compilers and language design. Still, I hope this issue will be my first contribution to this field.
I want to do it under Hacktoberfest label and would start when Hacktoberfest starts.
Thanks for helping out @HazemAbdo!
I want to help here as well @HazemAbdo if possible.
I was reading up on unicode's supplementary planes and also looking at if Zig's std library (utf8Decode) can help us here instead of running a loop with tracking bytes/states inside. But this is my first time diving into the internals of Roc as well. If possible can we start a Zulip thread for this. (So far while playing around with internals the tests on the main branch are failing on me). If nothing else I would like to shadow you on this and learn.
test result: FAILED. 133 passed; 10 failed; 2 ignored; 0 measured; 0 filtered out; finished in 50.15s
I was playing around with Zig to implement a simple function to test the examples provided in the Rust function.
And here's a simple way to achieve what we want (I think). I want to test the performance against decoding of the bytes manually, rather than relying on code points. But before that I want to ideally implement a solution end to end for Roc before any optimisations.
const std = @import("std");
pub fn fromUtf8Lossy(allocator: std.mem.Allocator, bytes: []const u8) ![]u8 {
var result = std.ArrayList(u8).init(allocator);
errdefer result.deinit();
var utf8 = std.unicode.Utf8View.initUnchecked(bytes);
var iterator = utf8.iterator();
while (iterator.i < bytes.len) {
if (iterator.nextCodepointSlice()) |codepoint_slice| {
try result.appendSlice(codepoint_slice);
} else {
// If we encounter an invalid sequence, append the replacement character
try result.appendSlice("�");
// Skip the invalid byte
iterator.i += 1;
}
}
return result.toOwnedSlice();
}
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const bytes = "Hello \xF0\x90\x80World".*;
const result = try fromUtf8Lossy(allocator, &bytes);
defer allocator.free(result);
std.debug.print("Converted string: {s}\n", .{result});
}
I was able to resolve the issues with the tests, thanks to @Anton-4
@HazemAbdo I hope this helps, if you are on Mac
I will wait to hear from you @HazemAbdo before proceeding any further. But this is the kind of issue which has really called out to me for some reason. But I am happy for you to take the lead as I just want to complete the whole cycle and learn how everything gets stitched together end to end from Zig to Roc.
@sajjaduc I will start my investigation of this issue and your work next Saturday. It will be great if you wait for me as I want to learn from your solution and have discussions with you.
Refer to this zulip discussion for more context
This will replace anything that can't be represented in a string with the Unicode Replacement Character (U+FFFD) �