morelinq / MoreLINQ

Extensions to LINQ to Objects
https://morelinq.github.io/
Apache License 2.0
3.67k stars 412 forks source link

Proposition | New form of Zip: MultiZip #1077

Open the-black-wolf opened 1 week ago

the-black-wolf commented 1 week ago

@atifaziz, as instructed

I used this in financial analytics. Its a form of a single Reduce step of Map/Reduce pattern. it implements Zip but on a variable number of sources, with the caveat that they are all of the same type.

Current values of all source streams are packed in an array and send to the caller suppled resultSelector reducer which will process them into TResult.

It takes several parameters:

public static IEnumerable<TResult> MultiZip<TSource, TResult>(
    IEnumerable<IEnumerable<TSource>> sourceList,
    Func<TSource[], bool[], TResult> resultSelector,
    MultiZipMissingSource missingSourceAction = MultiZipMissingSource.Remove,
    MultiZipMode operatingMode = MultiZipMode.Shortest,
    bool immutableResultSource = false)

sourceList is the list of streams. resultSelector is the reducer, missingSourceAction and operatingMode define how the cycling will be done (see below). Ultimately I've taken the naming scheme you used in the three separate Zip* methods with variable-type-fixed-streams you already have. Defaults to removing missing sources and doing the Shortest (exit on first exhaust) mode. immutableResultSource controls allocation by reusing the results arrays, which can make a difference in a scenario with large number of sources. For most uses it can be reused, if caller wants to store arrays somewhere, they can flip this to true. resultSelector is also supplied with a bool-map of which elements were sourced from streams and which are default(TSource) for exhausted streams in Longest mode of operation.

Explanation of enums:

        /// <summary>
        /// Determines the behavior of MultiZip extension when the list of sources contains a missing (null) source.
        /// </summary>
        public enum MultiZipMissingSource
        {
            /// <summary>
            /// Removes all missing (null) sources from the sourceList. This will affect the ordering of values sent to resultSelector. This is the default MultiZip behavior.
            /// </summary>
            Remove,
            /// <summary>
            /// Ignore missing (null) sources, treat them as empty, serving default values and indicating an exhausted source. Keep in mind that missing sources will prevent processing if exhausted action is break or error. 
            /// </summary>
            Ignore,
            /// <summary>
            /// Throws InvalidOperationException if any of the sources is missing (null). Will not throw exception on valid but empty sources.
            /// </summary>
            Error
        }

        /// <summary>
        /// Determines the behavior of MultiZip when one or more sources are exhausted. Effectively implements Shortest, Longest and Equi variants of the MultiZip implementation.
        /// </summary>
        public enum MultiZipMode
        {
            /// <summary>
            /// Stops processing as soon as any of the sources is exhausted. The boolean map sent to resultSelector will always contain `true` values and can be ignored. This is the default MultiZip behavior and is the same as MultiZip-Shortest.
            /// </summary>
            Shortest,
            /// <summary>
            /// Continues processing until all sources are exhausted. Exhausted sources will continue serving the default value for T, please use the boolean map sent to resultSelector to determine which values are sourced and which are default. This behavior is the same as MultiZip-Longest.
            /// </summary>
            Longest,
            /// <summary>
            /// Will throw InvalidOperationException if any (but not all) sources are exhausted. If all sources are exhausted at the same time, it simply ends processing. The boolean map sent to resultSelector will always contain `true` values and can be ignored. This behavior is the same as MultiZip-Equi.
            /// </summary>
            Equi
        }

The version I have does not support async, but it can be made. Also, several overrides can be made to facilitate different forms of result selector.

Ok, let me know. I can post this relatively quickly, I already have the code, I just need to write some tests.

the-black-wolf commented 1 week ago

Btw, hwo come you chose to use "Equi" word in Zip? Why not "Equal" or some such?

viceroypenguin commented 1 week ago

This is already available as the operator Transpose.

the-black-wolf commented 1 week ago

Not really the same thing, though in theory you could use MultiZip to transpose (even fully transpose with default placements) by yielding source array, its not its primary goal. MultiZip zips variable number of streams into one stream

atifaziz commented 1 week ago

Btw, hwo come you chose to use "Equi" word in Zip? Why not "Equal" or some such?

That was a very, very long time ago. See the following issues for historical context & discussions on Zip methods:

IIRC, I took inspiration from a similar name being used/proposed in Python. While Python eventually took another route and added it as an option to zip, there was some consideration given to adding zip_strict as a distinct function. PEP 618 is from 2020 whereas we had discussions going back almost another 10 years before that.

This is already available as the operator Transpose.

Not really the same thing, though in theory you could use MultiZip to transpose (even fully transpose with default placements) by yielding source array, its not its primary goal.

For background and discussions, see:

MultiZip zips variable number of streams into one stream

And Transpose does that too. It acquires all the iterators for all the streams at the beginning:

https://github.com/morelinq/MoreLINQ/blob/088df950da8f5539343147d6da27a122bf8e0ccb/MoreLinq/Transpose.cs#L64

and then yields an array of elements gathered from each before proceeding with the next:

https://github.com/morelinq/MoreLINQ/blob/088df950da8f5539343147d6da27a122bf8e0ccb/MoreLinq/Transpose.cs#L94

It does this until all iterators are exhausted:

https://github.com/morelinq/MoreLINQ/blob/088df950da8f5539343147d6da27a122bf8e0ccb/MoreLinq/Transpose.cs#L90-L91

The one difference is that a zip-like method will generally stop at the shortest sequence.

I find that it helps to discuss with code examples. Do you have a case of MultiZip where the result would be different than Transpose (assuming the simplest case for now)?

immutableResultSource controls allocation by reusing the results arrays, which can make a difference in a scenario with large number of sources.

This doesn't compose well and can be a source of surprises and bugs so it's best to design this differently. Have a look at PR #856, which added a distinct version of Batch that permits reusing arrays.

the-black-wolf commented 1 week ago

Well, the transpose does not have a resultSelector, or am I missing something? you would still need to feed this stream into another enumerator to reduce the transposed rows based on custom logic. Transpose returns IEnumerable<IEnumerable<TSource>>, MultiZip returns IEnumerable<TResult>

Transpose would also not let me know which values came through and which streams exhausted. So, yea, transpose would return [[10, 20, 30], [11, 31], [32]] from the documented code mentioned above, but MultiZip in Longest mode would return [[10, 20, 30], [11, 0, 31], [0, 0, 32]] and would also return [[true, true, true], [true, false, true], [false, false, true]] to tell you which values were actually sourced and from which stream position.

Well, one of the examples where we used this was loan transaction processing where it was fed account payment streams grouped by month/year and produced a stream of monthly analitics. It had about 30000 input streams and produced one (as I sad its a from of a reduce pattern). Hence the reusable result array (but we can make it default true if that is a concern, although unless you are projecting the result array into TResult itself it does not really matter if its reused, the mutliZip itself does not project those rows like Transpose does). This is a heavy duty cruncher by design where variable number of sources would be generated dynamically. We mostly used it in Longest mode. This is not really for a small number of fixed manually generated streams.

Btw, Strict does sound better, imho.

viceroypenguin commented 1 week ago

Well, the transpose does not have a resultSelector, or am I missing something? you would still need to feed this stream into another enumerator to reduce the transposed rows based on custom logic. Transpose returns IEnumerable<IEnumerable<TSource>>, MultiZip returns IEnumerable<TResult>

Technically, resultSelector is just a shortcut for Transpose().Select()...

atifaziz commented 1 week ago

Suppose the following:

var matrix = new[]
{
    new[] { 10, 11 },
    new[] { 20 },
    new[] { 30, 31, 32 },
    new[] { 40, 41, 42, 43, 44, 45, 46 },
};

Transpose would also not let me know which values came through and which streams exhausted. So, yea, transpose would return [[10, 20, 30], [11, 31], [32]] from the documented code mentioned above, but MultiZip in Longest mode would return [[10, 20, 30], [11, 0, 31], [0, 0, 32]]

matrix.Transpose().TakeWhile(col => col.Count() == 4) will give you the same behaviour as Zip where it stops yielding as soon as one of the source streams is exhausted.

matrix.Select(row => row.Pad(7)).Transpose() will give you the same behaviour as ZipLongest where all other streams will be padded as long as one of the source streams is still yielding.

Now, admittedly, both of these require you to know one of the two dimensions, either in terms of expected columns or rows.

Well, one of the examples where we used this was loan transaction processing where it was fed account payment streams grouped by month/year and produced a stream of monthly analitics. It had about 30000 input streams and produced one (as I sad its a from of a reduce pattern).

Good to hear about real use cases and thanks for sharing. I'm not surprised to be honest. I've dealt with gigabytes of data processing in constant memory with (More)LINQ so can appreciate the scale and dynamic nature of sources you might be dealing with.

Hence the reusable result array (but we can make it default true if that is a concern, although unless you are projecting the result array into TResult itself it does not really matter if its reused, the mutliZip itself does not project those rows like Transpose does). This is a heavy duty cruncher by design where variable number of sources would be generated dynamically. This is not really for a small number of fixed manually generated streams.

If you're having to work with large arrays along the likes of 30,000 elements, then they'll easily end up on the LOH. They won't be collected with younger collections and the LOH isn't compacted by default (although this can be controlled in a limited way now), which can lead to fragmentation. Reusing the array certainly makes sense if TResult is going to be small.

If we decide to add this, it seems to me that this will have a design similar to Batch that allows reuse of arrays from a pool. Did you have a look at its PR and implementation?

The MultiZipMissingSource option seems odd to me. I don't think anything else in LINQ does that.

Btw, Strict does sound better, imho.

Yeah, naming is hard. People used to get confused by Zip alone, but now have just gotten used to it.