morelinq / MoreLINQ

Extensions to LINQ to Objects
https://morelinq.github.io/
Apache License 2.0
3.69k stars 415 forks source link

Spill head projection to remaining sequence #752

Open atifaziz opened 4 years ago

atifaziz commented 4 years ago

I propose to add an extension that takes one or more elements at the head of a sequence and spills a projection to remaining elements of the sequence.

Sometimes you have a sequence where the initial element(s) contains information about the processing of the rest of the sequence. A typical example is a table (think CSV) where the table is composed of rows; where the first row is a header and the remaining the data rows. Processing such a sequence should only generate a projection of the data rows.

The signature would be as follows:

public static IEnumerable<R>
    SpillSpan<T, M, A, H, R>(
        this IEnumerable<T> source,
        Func<T, (bool, M)> chooser,
        A empty,
        Func<M, A> seeder,
        Func<A, M, A> accumulator,
        Func<A, H> headerSelector,
        Func<H, T, R> resultSelector)

The chooser identifies header elements. Data rows commence as soon as it returns (false, _) for a T. The header elements are accumulated via accumulator and the seeder is used to seed the accumulation with the initial header element. The empty value is used for the headless case. The headerSelector function is used to create a single projection out of the accumulated header elements and which is subsequently paired with or spilled to remaining elements.

I propose to add overloads for simpler cases.

SpillSpan should never throw an exception. If the user wants to ban the headless case, he/she can throw in headerSelector upon receiving the empty value.

Example

const string csv = @"
    a,c,b
    1,3,2
    4,6,5
    7,9,8";

data result =
    from rows in new[]
    {
        from line in Regex.Split(csv.Trim(), @"\r?\n")
        select line.Split(',').Select(f => f.Trim()).ToArray()
    }
    from row in
        rows.Index()
            .SpillSpan(
                r => r.Key == 0,
                null,
                r => r.Value,
                (r, _) => r,
                h => MoreEnumerable.Return(h.Index().ToDictionary(e => e.Value, e => e.Key)) // name => index
                                   .SelectMany(_ => new[] { "a", "b", "c" }, (m, n) => m[n]) // lookup index of name
                                   .ToArray(),
                (bs, r) => bs.Select(i => int.Parse(r.Value[i], CultureInfo.InvariantCulture))
                             .Fold((a, b, c) => new { A = a, B = b, C = c }))
    select row;

foreach (var row in data)
    Console.WriteLine(row);

Output:

{ A = 1, B = 2, C = 3 }
{ A = 4, B = 5, C = 6 }
{ A = 7, B = 8, C = 9 }
atifaziz commented 4 years ago

Based on work so far in PR #753, I feel it's best to rename the extension to SpillHead. The simplest overload will spill the first element of the sequence to the second and beyond so in the case of one element representing the head, span seems confusing. However, head works well whether you have a single element or several making up the “head” of the sequence.

declard commented 1 year ago

Is it essentially this, but working in constant space?

public static IEnumerable<R>
    SpillSpan<T, P, R>(
        this IEnumerable<T> source,
        Func<T, bool> predicate,
        Func<T, P> prefixMapping,
        Func<P, T, R> resultSelector)
{
    var (prefix, rest) = source.Span(predicate);
    var mappedPrefix = prefixMapping(prefix);
    return rest.Select(e => resultSelector(mappedPrefix, e));
}

// applied to a predicate `predicate` and an enumerable `source`, returns a tuple where first element is longest prefix (possibly empty) of `source` of elements that satisfy `predicate` and second element is the remainder of the enumerable
// >>> span (< 3) [1,2,3,4,1,2,3,4]
// ([1,2],[3,4,1,2,3,4])
public static (IReadOnlyCollection<T> Prefix, IReadOnlyCollection<T> Rest) Span<T>(this IEnumerable<T> source, Func<T, bool> predicate);

The A empty+Func<M, A> seeder+Func<A, M, A> accumulator+Func<A, H> headerSelector logic can be made external and encapsulated into Func<T, P> prefixMapping. chooser handles both predication and mapping, so the mapping part can be moved into prefixMapping too.

atifaziz commented 1 year ago

Not quite like span/Parition, because here the proposal is to consume in a streaming fashion. So while the head/header is collected and projected to help process the remainder, the remainder of the sequence (assuming they are the data rows) is streamed. If you have billions of rows, this operator will lazily only consume what's needed. It can be combined with other streaming operators to avoid committing the entire source to memory.