leandromoh / RecordParser

Zero Allocation Writer/Reader Parser for .NET Core
MIT License
292 stars 10 forks source link

[Question] Read 2D Byte Array CSV #103

Open smj2350 opened 2 weeks ago

smj2350 commented 2 weeks ago

Hello,

I’m currently working with grayscale data stored in a CSV file, represented as a 2D byte array with dimensions 256 x 23156. I want to parse this data effectively using RecordParser, but I couldn’t find a suitable approach with the existing features.

Is there an efficient way to convert each row into a byte[] or parse the entire CSV file into a 2D array (such as byte[,] or byte[][])?

I’d like to inquire if there are better approaches for handling data of this size or if there are any additional features in RecordParser that could support this use case. I would also like to request support for directly handling 2D arrays if possible.

Thank you.

leandromoh commented 1 week ago

Hello @smj2350

Is there an efficient way to convert each row into a byte[] or parse the entire CSV file into a 2D array (such as byte[,] or byte[][])?

yes, of couse. if the length of columns are know things are simple. we just need a factory to build the array for each mapped line. the most weird part is the mapping of columns, that for some reason dotnet does not support to write the mapping expression directly, so we need do it manually.

Does the example bellow helps you? demonstration here

using System;
using System.IO;
using System.Linq.Expressions;
using RecordParser.Builders.Reader;
using RecordParser.Parsers;
using RecordParser.Extensions;

var columnCount = 5;
var linesCount = 3;
var array = new byte[linesCount][];

var content = """
    1,0,1,0,1
    0,1,0,1,0
    1,1,1,1,1
    """;

// I am using StringReader because in the example the content is a string
// but could be StreamReader or any TextReader
using TextReader source = new StringReader(content);
var parser = BuildReader(columnCount);

var readOptions = new VariableLengthReaderOptions
{
    HasHeader = false,
    ParallelismOptions = new() { Enabled = [true|false] },
    ContainsQuotedFields = false,
};

var lines = source.ReadRecords(parser, readOptions);

var i = 0;
foreach (var line in lines)
    array[i++] = line;

Display(array);

static void Display(byte[][] array)
{
    foreach (var line in array)
    {
        foreach (var column in line)
            Console.Write(column + " ");

        Console.WriteLine();
    }
}

static IVariableLengthReader<byte[]> BuildParse(int columnCount)
{
    var builder = new VariableLengthReaderSequentialBuilder<byte[]>();

    for (var i = 0; i < columnCount; i++)
        // map each array position to be assign (just like a regular field, in fact the expression here
        // dont need to be a field/property, but just an assignable expression)
        builder.Map(BuildExpression(i));

    var reader = builder.Build(",", factory: () => new byte[columnCount]);

    return reader;
}

// builds the lambda: array => array[i]
static Expression<Func<byte[], byte>> BuildExpression(int i)
{
    var arrayExpr = Expression.Parameter(typeof(byte[]));
    var indexExpr = Expression.Constant(i);
    var arrayAccessExpr = Expression.ArrayAccess(arrayExpr, indexExpr);

    return Expression.Lambda<Func<byte[], byte>>(arrayAccessExpr, arrayExpr);
}

I’d like to inquire if there are better approaches for handling data of this size or if there are any additional features in RecordParser that could support this use case.

What I can think is you could use an ArrayPool to save some memory. You could use it in the factory to create the array for lines. So after use the byte[][] you could do something like foreach(var line in array) ArrayPool<byte>.Return(line). When using large arrays with short live array pool is pretty nice.

I would also like to request support for directly handling 2D arrays if possible.

I do not know if it would be worth. to map arrays are unusual, 2d even more unusual. each cenario is very particular, specially in the factory. if there are demand in the future I can add it. for now, what I suggest you is wrap the code in some extension method or something similar.