morelinq / MoreLINQ

Extensions to LINQ to Objects
https://morelinq.github.io/
Apache License 2.0
3.67k stars 411 forks source link

ToArray2D (two-dimensional array from a sequence) #506

Open atifaziz opened 6 years ago

atifaziz commented 6 years ago

I propose to add an extension that allows creation of a two-dimensional array given a sequence and a count of columns.

Prototype

public static T[,] ToArray2D<T>(this IEnumerable<T> source, int columns)
{
    if (columns == 0)
        return new T[0, 0];

    var array = source.ToArray();

    var rows = array.Length / columns + (array.Length % columns == 0 ? 0 : 1);
    var result = new T[rows, columns];
    for (var i = 0; i < array.Length; i++)
        result[i / columns, i % columns] = array[i];
    return result;
}

Examples

Example 1 where columns = 0

var xs = Enumerable.Range(1, 10).ToArray2D(0);

will be equivalent to new int[0, 0].

Example 2 where columns = 2

var xs = Enumerable.Range(1, 10).ToArray2D(2);

will be equivalent to:

var xs = new int[,]
{
    { 1, 2  },
    { 3, 4  },
    { 5, 6  },
    { 7, 8  },
    { 9, 10 },
};

Example 3 where columns = 3

var xs = Enumerable.Range(1, 10).ToArray2D(3);

will be equivalent to:

var xs = new int[,]
{
    {  1, 2, 3 },
    {  4, 5, 6 },
    {  7, 8, 9 },
    { 10, 0, 0 },
};
fsateler commented 6 years ago

I have some questions:

  1. What would be some use cases for this?
  2. Why is this operator permissive instead of strict? I mean that it accepts source sequences where Count is not a multiple of columns. Most MoreLINQ operators are stricter.
atifaziz commented 6 years ago

What would be some use cases for this?

Some environments, like Excel, use 2D arrays for data transfer. While multi-dimensional arrays in .NET have a built-in implementation for iteration as a sequence, there is no method to go the other way.

Why is this operator permissive instead of strict? Most MoreLINQ operators are stricter.

We can debate the pros and cons but as you said, it's most, not all. I've also had second thoughts about operators where we decided to be strict but where the argument validation is done later, during iteration. In hindsight, the behaviour should have been left undefined. We have two options. Make it strict or add an overload where a function that, given a row and a column, provides a filler. If someone wants strict semantics, they can throw from that function. Thoughts?

fsateler commented 6 years ago

What would be some use cases for this?

Some environments, like Excel, use 2D arrays for data transfer. While multi-dimensional arrays in .NET have a built-in implementation for iteration as a sequence, there is no method to go the other way.

OK, but why would I be operating on a sequence that is really a matrix? Or put it another way, how do I end up with a matrix represented as a single sequence?

We have two options. Make it strict or add an overload where a function that, given a row and column, provides a filler. If someone wants strict semantics, they can throw from that function. Thoughts?

I think the overload with the filler is a good compromise. I would still leave the default to be a throwing version, though.

I've also had second thoughts about operators where we decided to be strict but where the argument exception is thrown later, during iteration.

This operator is not delayed so this problem does not appear here.


Further thought makes me think this is mixing two separate problems: break a sequence into parts, and then transforming that into a 2D array. We already have operators for the latterfirst, so I guess we could have only the second part:

public static T[,] To2DArray(IEnumerable<IEnumerable<T>> rows, int columns);

This can be paired with Batch to achieve the same as the originally proposed method.

Thoughts?

atifaziz commented 6 years ago

OK, but why would I be operating on a sequence that is really a matrix? Or put it another way, how do I end up with a matrix represented as a single sequence?

Cheating… 😊

var data =
    from matrix in new[]
    {
        new object[,]
        {
            { new DateTime(2018, 1, 5), 87.66, 88.41, 87.43, 88.19, 87.40, 23_407_100 },
            { new DateTime(2018, 1, 4), 86.59, 87.66, 86.57, 87.11, 86.33, 21_912_000 },
            { new DateTime(2018, 1, 3), 86.06, 86.51, 85.97, 86.35, 85.58, 26_061_400 },
            { new DateTime(2018, 1, 2), 86.13, 86.31, 85.50, 85.95, 85.18, 22_483_800 },
        }
    }
    from row in ((IEnumerable) matrix).Cast<object>().Batch(7)
    select row.Fold((date, open, hi, lo, close, _, vol) => new
    {
        Date   = (DateTime) date,
        Open   = (double)   open,
        High   = (double)   hi,
        Low    = (double)   lo,
        Close  = (double)   close,
        Volume = (int)      vol,
    });

This operator is not delayed so this problem does not appear here.

True except I am not sure that eager or lazy evaluation really makes a difference. Argument checking should take place before the main operation of the method. Anything during the operation really qualifies as undefined behaviour or should surface as a custom exception if it's important to distinctly communicate the nature of the error.

Further thought makes me think this is mixing two separate problems: break a sequence into parts, and then transforming that into a 2D array…Thoughts?

That's an interesting way to look at it. So you mean, like this?

public static T[,] ToArray2D<T>(this IEnumerable<IEnumerable<T>> source, int columns)
{
    if (columns == 0)
        return new T[0, 0];

    var rows = source.ToArray();
    var result = new T[rows.Length, columns];
    var y = 0;
    foreach (var row in rows)
    {
        var x = 0;
        foreach (var cell in row)
            result[y, x++] = cell;
        y++;
    }
    return result;
}

It does introduce a couple of corner cases one needs to think about when the batch size doesn't match the column count. When batch size is smaller, as in Enumerable.Range(1, 10).Batch(2).ToArray2D(3), you get back:

new int[,]
{
    {  1,  2, 0 },
    {  3,  4, 0 },
    {  5,  6, 0 },
    {  7,  8, 0 },
    {  9, 10, 0 },
}

If it's bigger, the implementation fails with an IndexOutOfRangeException. I guess this is correct if you want to be strict but then the case of short rows should also throw.