Closed atifaziz closed 5 years ago
This sounds like a good idea!
@fsateler Glad to hear you like the idea. The API still needs a bit of work I feel because writing out an aggregation in-place, even something as simple as counting, means re-writing what's already there and could be tedious and error-prone. What's more, there's no opportunity for writing reusable accumulators. Here's what I am aiming for the next (possibly final?) shape of the API:
Console.WriteLine(
xs.Aggregate(
a => a.Sum(),
a => a.Count(),
a => a.Min(),
a => a.Max(),
a => a.Collect(new List<int>()),
(sum, count, min, max, list) => new
{
Sum = sum,
Count = count,
Average = (double) sum / count,
Min = min,
Max = max,
Content = "[" + string.Join(", ", list) + "]",
}));
To achieve this, we need a type to represent an aggregation algorithm:
interface IAggregator<T, TResult>
{
void Next(T item);
TResult GetResult();
}
Notice that the seed is gone as that is now the responsibility of the IAggregator<,>
initialization. Next come two seemingly bizarre definitions but bear with me on those for now; take them as a mere hack for type inference and more:
struct AggregatorSource<T> { }
partial class Aggregator
{
public static AggregatorSource<T> AsAggregatorSource<T>(this IEnumerable<T> source) =>
default(AggregatorSource<T>);
}
Now an Aggregate
overload for five aggregations looks like this:
public static TResult Aggregate<T, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<T> source,
Func<AggregatorSource<T>, IAggregator<T, TResult1>> aggregatorSelector1,
Func<AggregatorSource<T>, IAggregator<T, TResult2>> aggregatorSelector2,
Func<AggregatorSource<T>, IAggregator<T, TResult3>> aggregatorSelector3,
Func<AggregatorSource<T>, IAggregator<T, TResult4>> aggregatorSelector4,
Func<AggregatorSource<T>, IAggregator<T, TResult5>> aggregatorSelector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector) =>
source.Aggregate(
(Agg1: aggregatorSelector1(source.AsAggregatorSource()),
Agg2: aggregatorSelector2(source.AsAggregatorSource()),
Agg3: aggregatorSelector3(source.AsAggregatorSource()),
Agg4: aggregatorSelector4(source.AsAggregatorSource()),
Agg5: aggregatorSelector5(source.AsAggregatorSource())),
(s, e) =>
{
s.Agg1.Next(e);
s.Agg2.Next(e);
s.Agg3.Next(e);
s.Agg4.Next(e);
s.Agg5.Next(e);
return s;
},
s => resultSelector(s.Agg1.GetResult(),
s.Agg2.GetResult(),
s.Agg3.GetResult(),
s.Agg4.GetResult(),
s.Agg5.GetResult()));
The implementation is quite straightforward. The aggregators become the seeds themselves. Each aggregator maintains its own state and folds each value of the sequence into it. At the end, GetResult
calls are used to string together the final result.
Next comes infrastructure to help create an IAggregator<,>
implementation easily given a seed and functions for the Next
and GetResult
members of the interface:
static partial class Aggregator
{
public static IAggregator<T, TResult> Create<T, TState, TResult>(
TState seed,
Func<TState, T, TState> accumulator,
Func<TState, TResult> resultSelector) =>
new DelegatingAggregator<T, TState, TResult>(seed, accumulator, resultSelector);
sealed class DelegatingAggregator<T, TState, TResult> : IAggregator<T, TResult>
{
TState _state;
readonly Func<TState, T, TState> _accumulator;
readonly Func<TState, TResult> _resultSelector;
public DelegatingAggregator(TState seed, Func<TState, T, TState> accumulator, Func<TState, TResult> resultSelector)
{
_state = seed;
_accumulator = accumulator;
_resultSelector = resultSelector;
}
public void Next(T item) => _state = _accumulator(_state, item);
public TResult GetResult() => _resultSelector(_state);
}
}
With that, here's how the aggregators can be implemented:
partial class Aggregator
{
public static IAggregator<T, int> Count<T>(this AggregatorSource<T> _) =>
Aggregator.Create(0, (int s, T t) => s + 1, s => s);
public static IAggregator<int, int> Sum(this AggregatorSource<int> _) =>
Aggregator.Create(0, (int s, int x) => s + x, s => s);
public static IAggregator<T, T> Some<T>(Func<T, T, T> acc) =>
Aggregator.Create((Count: 0, Result: default(T)),
((int Count, T Result) s, T e) => (s.Count + 1, s.Count > 0 ? acc(s.Result, e) : e),
s => s.Count > 0 ? s.Result : throw new InvalidOperationException());
public static IAggregator<int, int> Min(this AggregatorSource<int> _) => Some<int>(Math.Min);
public static IAggregator<int, int> Max(this AggregatorSource<int> _) => Some<int>(Math.Max);
public static IAggregator<double, double> Min(this AggregatorSource<double> _) => Some<double>(Math.Min);
public static IAggregator<double, double> Max(this AggregatorSource<double> _) => Some<double>(Math.Max);
public static IAggregator<T, TCollection> Collect<T, TCollection>(this AggregatorSource<T> _, TCollection collection)
where TCollection : ICollection<T> =>
Aggregator.Create(collection, (TCollection s, T t) => { s.Add(t); return s; }, s => s);
}
Notice how Min
and Max
reuse Some
. Notice further how the right Min
and Max
overloads will be picked depending on T
, thanks to AggregatorSource<>
. Count
is also agnostic to T
.
We can also now introduce a Count
overload that counts elements satisfying a predicate:
partial class Aggregator
{
public static IAggregator<T, int> Count<T>(this AggregatorSource<T> _, Func<T, bool> predicate) =>
Aggregator.Create(0, (int s, T t) => s + (predicate(t) ? 1 : 0), s => s);
}
and use it like this:
Console.WriteLine(
xs.Aggregate(
a => a.Sum(),
a => a.Count(),
a => a.Min(),
a => a.Max(),
a => a.Count(x => x % 2 == 0),
(sum, count, min, max, evens) => new
{
Sum = sum,
Count = count,
Average = (double) sum / count,
Min = min,
Max = max,
EvenCount = evens
}));
Earlier, we saw Collect
, which can collect elements into any collection. If collecting into a list is a common scenario then one can also introduce a simple helper for that now:
partial class Aggregator
{
public static IAggregator<T, List<T>> List<T>(this AggregatorSource<T> _) =>
_.Collect(new List<T>());
}
And naturally it builds on top of Collect
. This makes the usage really readable and clean:
Console.WriteLine(
xs.Aggregate(
a => a.Sum(),
a => a.Count(),
a => a.Min(),
a => a.Max(),
a => a.List(), // instead of // a => a.Collect(new List<int>()),
(sum, count, min, max, list) => new
{
Sum = sum,
Count = count,
Average = (double) sum / count,
Min = min,
Max = max,
Content = "[" + string.Join(", ", list) + "]",
}));
What's nice is that the calls to Sum
, Count
, Min
and Max
look just like how you would use them in normal LINQ, but here they do their work in conjunction with others and doing so in a single iteration.
Thoughts & feedback on this new design?
This is an interesting design. I like the IAggregator<>
interface as that can be used to implement optimized operators that do not allocate new memory on each iteration (and it would be nice if we did that for the standard operators).
I'm still thinking about your proposed design. I've so far failed to come up with something that dispenses with the AggregatorSource
but preserves an easy way to add aggregators with selectors (I mean .Sum(obj => obj.Property)
). All alternatives I think about are quite ugly. With the AggregatorSource
things become fairly easy:
public static IAggregator<T, int> Sum<T>(this AggregatorSource<T> _, Func<T, int> selector) =>
Aggregator.Create(0, (int s, T x) => s + selector(x), s => s);
It's ugly but it makes things a lot easier.
I had a fairly ugly start as well but after coming up with AggregatorSource<>
, everything just fell nicely into place. Perhaps the name does throw one off but its purpose is to communicate the source type, not the source sequence. This is why it's just a unit/singleton for every T
. Does it still feel ugly then?
Your projected Sum
example is interesting. I think we can even generalize that via a Select
:
public static IAggregator<T1, TResult> Select<T1, T2, TResult>(
this AggregatorSource<T1> source, Func<T1, T2> selector,
Func<AggregatorSource<T2>, IAggregator<T2, TResult>> aggregatorSelector)
{
var aggregator = aggregatorSelector(default(AggregatorSource<T2>));
return Aggregator.Create(default(object),
(object _, T1 t) => { aggregator.Next(selector(t)); return _; },
_ => aggregator.GetResult());
}
So given another aggregator, it injects a projection. You can then even reuse it to implement your Sum
overload:
public static IAggregator<T, int> Sum<T>(this AggregatorSource<T> _, Func<T, int> selector) =>
_.Select(selector, a => a.Sum());
You could even use Select
inline:
Console.WriteLine(
"1,2,3,4,5,6,7,8,9,10"
.Split(',')
.Aggregate(
a => a.Select(int.Parse, aa => aa.Sum()),
a => a.Count(),
(sum, count) => new { Sum = sum, Count = count }));
Since we are composing aggregators now, I wonder if there's another trick left that could make the composition read more like a.Select(int.Parse).Sum()
, as in:
Console.WriteLine(
"1,2,3,4,5,6,7,8,9,10"
.Split(',')
.Aggregate(
a => a.Select(int.Parse).Sum(),
a => a.Count(),
(sum, count) => new { Sum = sum, Count = count }));
:thinking:
Since we are composing aggregators now, I wonder if there's another trick left that could make the composition read more like
a.Select(int.Parse).Sum()
This could be solved by making the AggregatorSource
be a useful object:
public class AggregatorSource<T, TProp> {
public AggregatorSource(Func<T, TProp> getValue) {
GetValue = getValue;
}
public Func<T, TProp> GetValue { get; }
public AggregatorSource<T, TNestedProp> Select<TNestedProp>(Func<TProp, TNestedProp> projection) {
var gv = GetValue;
return new AggregatorSource<T, TNestedProp>(v => projection(gv(v)));
}
}
public static class Aggregator {
public static AggregatorSource<T, T> AsAggregatorSource<T>(this IEnumerable<T> source) =>
new AggregatorSource<T, T>(i => i);
public static IAggregator<T, int> Sum<T>(this AggregatorSource<T, int> source) =>
Aggregator.Create(0, (int s, T x) => s + source.GetValue(x), s => s);
public static IAggregator<T, int> Count<T>(this AggregatorSource<T, T> _) =>
Aggregator.Create(0, (int s, T t) => s + 1, s => s);
}
partial class MoreEnumerable {
public static TResult Aggregate<T, TResult1, TResult2, TResult>(
this IEnumerable<T> source,
Func<AggregatorSource<T, T>, IAggregator<T, TResult1>> aggregatorSelector1,
Func<AggregatorSource<T, T>, IAggregator<T, TResult2>> aggregatorSelector2,
Func<TResult1, TResult2, TResult> resultSelector) =>
source.Aggregate(
(Agg1: aggregatorSelector1(source.AsAggregatorSource()),
Agg2: aggregatorSelector2(source.AsAggregatorSource())
),
(s, e) =>
{
s.Agg1.Next(e);
s.Agg2.Next(e);
return s;
},
s => resultSelector(s.Agg1.GetResult(),
s.Agg2.GetResult())
);
}
And now your last example compiles.
2 questions:
ToList()
to sorce before do the aggregations? So you only iterate source
once.@fsateler Thanks for your feedback, ideas and participating in the design.
making the
AggregatorSource
be a useful object
I agree and I think we're getting closer. More than just Select
, I wanted to find a general way to combine operations for each item of the source, like a Where
for filtering (as LINQ's Count
supports). What follows is my next take on the design of the API.
I've dropped IAggregator<>
for now and redefined the aggregator source as:
interface IAggregatorSource<T>
{
void Connect(Action<T> sink);
}
Each aggregator will be responsible for connecting to the source. The sink is where items will be pushed. Here's some helper code for implementing the interface given a connection handler:
static partial class AggregatorSource
{
public static IAggregatorSource<T> Create<T>(Action<Action<T>> connectionHandler) =>
new DelegatingAggregatorSource<T>(connectionHandler);
sealed class DelegatingAggregatorSource<T> : IAggregatorSource<T>
{
readonly Action<Action<T>> _connectionHandler;
public DelegatingAggregatorSource(Action<Action<T>> connectionHandler) =>
_connectionHandler = connectionHandler;
public void Connect(Action<T> sink) => _connectionHandler(sink);
}
}
Now the interesting part is that we can define Select
and Where
in terms of IAggregatorSource<>
:
partial class AggregatorSource
{
public static IAggregatorSource<U> Select<T, U>(this IAggregatorSource<T> source, Func<T, U> selector) =>
Create<U>(sink => source.Connect(item => sink(selector(item))));
public static IAggregatorSource<T> Where<T>(this IAggregatorSource<T> source, Func<T, bool> predicate) =>
Create<T>(sink => source.Connect(item => { if (predicate(item)) sink(item); }));
}
Aggregators now connect to a source, supplying a handler for each item (what used to be Next
on IAggregator<>
), and return a function for obtaining the final result (what used to be GetResult
on IAggregator<>
):
static partial class Aggregator
{
public static Func<TState> Connect<TState, T>(
this IAggregatorSource<T> source,
TState seed, Func<TState, T, TState> accumulator) =>
source.Connect(seed, accumulator, s => s);
public static Func<TResult> Connect<TState, T, TResult>(
this IAggregatorSource<T> source,
TState seed, Func<TState, T, TState> accumulator,
Func<TState, TResult> resultSelector)
{
var state = seed;
source.Connect(item => state = accumulator(state, item));
return () => resultSelector(state);
}
public static Func<int> Sum(this IAggregatorSource<int> source) =>
source.Connect(0, (s, x) => s + x);
public static Func<int> Count<T>(this IAggregatorSource<T> source) =>
source.Connect(0, (s, _) => s + 1);
public static Func<T> Some<T>(this IAggregatorSource<T> source, Func<T, T, T> accumulator) =>
source.Connect(
(Count: 0, State: default(T)),
(s, e) => (s.Count + 1, s.Count > 0 ? accumulator(s.State, e) : e),
s => s.Count > 0 ? s.State : throw new InvalidOperationException());
public static Func<T> Min<T>(this IAggregatorSource<T> source)
where T : IComparable<T> =>
source.Some((x, y) => x.CompareTo(y) < 0 ? x : y);
public static Func<T> Max<T>(this IAggregatorSource<T> source)
where T : IComparable<T> =>
source.Some((x, y) => x.CompareTo(y) > 0 ? x : y);
public static Func<List<T>> List<T>(this IAggregatorSource<T> source) =>
source.Collect(new List<T>());
public static Func<ISet<T>> Distinct<T>(this IAggregatorSource<T> source, IEqualityComparer<T> comparer = null) =>
source.Collect(new HashSet<T>(comparer));
public static Func<TCollection> Collect<T, TCollection>(this IAggregatorSource<T> source, TCollection collection)
where TCollection : ICollection<T> =>
source.Connect(collection, (TCollection s, T t) => { s.Add(t); return s; }, s => s);
}
Finally, here's the Aggregate
method that puts it together:
static class MoreEnumerable
{
public static TResult Aggregate<T, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<T> source,
Func<IAggregatorSource<T>, Func<TResult1>> aggregatorConnector1,
Func<IAggregatorSource<T>, Func<TResult2>> aggregatorConnector2,
Func<IAggregatorSource<T>, Func<TResult3>> aggregatorConnector3,
Func<IAggregatorSource<T>, Func<TResult4>> aggregatorConnector4,
Func<IAggregatorSource<T>, Func<TResult5>> aggregatorConnector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector)
{
var s1 = new AggregatorSource<T>(); var c1 = aggregatorConnector1(s1);
var s2 = new AggregatorSource<T>(); var c2 = aggregatorConnector2(s2);
var s3 = new AggregatorSource<T>(); var c3 = aggregatorConnector3(s3);
var s4 = new AggregatorSource<T>(); var c4 = aggregatorConnector4(s4);
var s5 = new AggregatorSource<T>(); var c5 = aggregatorConnector5(s5);
foreach (var item in source)
{
s1.Next(item);
s2.Next(item);
s3.Next(item);
s4.Next(item);
s5.Next(item);
}
return resultSelector(c1(), c2(), c3(), c4(), c5());
}
sealed class AggregatorSource<T> : IAggregatorSource<T>
{
Action<T> _sink;
public void Connect(Action<T> sink) => _sink = sink;
public void Next(T item) => _sink(item);
}
}
A push source is created to which every aggregator is connected. The items are then pulled from the source and pushed to each sink, each of which can do its projection, filtering and what not before being folded by the aggregator at the end of the chain.
Here's some usage showing all of the above in action:
var xs = "1,2,3,4,5,6,7,8,9,10".Split(',');
Console.WriteLine(
xs.Select(s => new { Str = s, Num = int.Parse(s) })
.Aggregate(
s => s.Select(e => e.Num).Sum(),
s => s.Count(),
s => s.Select(e => e.Num).Min(),
s => s.Select(e => e.Num).Max(),
s => s.Where(e => e.Str != null) // can't happen but for illustration
.Select(e => e.Str.Length)
.Distinct(),
(sum, count, min, max, lengths) => new
{
Sum = sum,
Count = count,
Avgerage = (double) sum / count,
Min = min,
Max = max,
UniqueLengths
= "[" + string.Join(", ", lengths) + "]",
}
));
An interesting thing to note is that IAggregatorSource<>
with its connect and sink is there to model a push stream; put another way, aggregators are observers and subscribe to a shared iteration. In .NET, we already have IObservable<>
(source) and IObserver<>
(sink) to model that so we don't need any new types or representations. Also, Rx already has Select
, Where
and other operators implemented so we don't need to do anything! The two downsides are:
IObservable<>
-IObserver<>
protocol. That is, most aggregators won't care about errors, completion notice and unsubscribing.I think 1 is better than introducing new types that more or less mean the same thing. As for 2, you can mitigate some of that by building in the simple cases of projections and filtering directly into the aggregators as LINQ does today.
- why simple dont apply
ToList()
to sorce before do the aggregations? So you only iteratesource
once.
@leandromoh Suppose you have a directory of hundreds of CSV files, each with several thousands of lines of data. With ToList
, you will not only waste time & space building a list (multiple re-allocations & copying as it grows dynamically) but you will also waste time traversing the list of hundreds of thousands of records several times (once per type of aggregation).
- how common is it ?
What may be common in my problem space may not be in yours so I don't see how I can answer this objectively. Everyone's mileage will vary.
@atifaziz Interesting new proposal! It adds a bit of complexity but it is indeed much more powerful. I have some reservations on reusing the IObserver<>
abstraction.
OnCompleted
is not used. In general I favor keeping interfaces as minimal as possible.OnError
seems to be actively harmful. When IObservable<>
-IObserver<>
are used it makes sense to propagate errors down the sink chain because the error might be thrown in a separate thread, or where nobody is able to catch it. In our case, Aggregate
is not only synchronous, but it is eagerly evaluated, and thus the point for catching errors is natural: just wrap the call with a try{} catch{}
block. I'm not sure how exception handling would look like when OnError
is (possibly) used. One would have to handle two possible error paths?OTOH, I'm thinking this is already way too similar to Rx. It may well be that what I see as unneeded bloat is just domain complexity I don't see.
But, if reusing IObserver<>
is the way forward, would it be best placed here or in Rx?
OnError
isn't so much for catching an exception here as it is a way for Aggregate
to notify the accumulators of an error, while either iterating the sequence (i.e. during a call to MoveNext
) or when an adjacent accumulator throws. The only thing I can imagine an accumulator doing during OnError
is clean-up and my guess is that that's almost never going to be needed. Same goes for OnComplete
but it may be slightly more useful and I'll get into it another time.
just wrap the call with a try{} catch{} block
So yes, this would still be the case for the user of Aggregate
but OnError
is only for notification to accumulators during the main loop as the pull is converted to a push.
In general I favor keeping interfaces as minimal as possible.
Me too but I think having IAggregateSource<>
with a Connect
method, while simpler, is just confusing as it has nothing to do with aggregation per se. Initially, I had it there as a type inference hack and so struggled to give it a proper name and a real purpose. The refactorings that came with enabling a compositional pipeline over it lead to the observer pattern.
But, if reusing
IObserver<>
is the way forward, would it be best placed here or in Rx?
I'd say it still belongs here because fundamentally we're talking about Aggregate
for sequences and any use of observable/observer is just to model how Aggregate
interacts with the accumulators.
@atifaziz OK, your explanation is convincing. I still have one thing I didn't quite understand correctly:
you can mitigate some of that by building in the simple cases of projections and filtering directly into the aggregators as LINQ does today.
Could you elaborate on this? I did not understand how would this problem would be mitigated.
Could you elaborate on this? I did not understand how would this problem would be mitigated.
@fsateler I wouldn't want to expose Select
& Where
as extensions of IObservable<>
because we'd run the risk of creating conflicts for projects using Rx and LINQ. If someone wants to use those for an accumulation pipeline, as in a.Where(x => x % 2 == 0).Count()
or a.Select(int.Parse).Sum()
, they'll then have to bring in Rx. By offering projection and filtering on overloads of the accumulators, as in a.Count(x => x % 2 == 0)
(what LINQ does too), we can help avoid anyone taking a dependency on Rx for the most common cases.
I wouldn't want to expose
Select
&Where
as extensions ofIObservable<>
because we'd run the risk of creating conflicts for projects using Rx and LINQ.
Actually, come to think of it, there's still all the aggregation methods like Sum
and Count
that extend IObservable<>
too so conflicts and confusion could still reign. Those aggregation methods actually terminate the composition chain by subscribing then aggregating and perhaps this needs to be reflected in the names? Like DoSum
and DoCount
? :thinking:
Actually, come to think of it, there's still all the aggregation methods like
Sum
andCount
that extendIObservable<>
too so conflicts and confusion could still reign
Indeed. That's what I was thinking.
Those aggregation methods actually terminate the composition chain by subscribing then aggregating and perhaps this needs to be reflected in the names? Like DoSum and DoCount? :thinking:
I'm not really a fan of Do[Name]
, since it is very unclear on why would it differ from a non-Do
method.
The aggregation methods provided by Rx are not useful for us, since they will block and try to aggregate instantly. I'm surprised there is no Async overload for all of those operators. Maybe we could provide Async operators in a separate namespace, and then when Rx adds them people would just stop importing the separate namespace.
The aggregation method would look something like this (untested since I'm not on my workstation right now):
static class MoreEnumerable
{
public static TResult Aggregate<T, TResult1, TResult2>(
this IEnumerable<T> source,
Func<IObservable<T>, Task<TResult1>> aggregatorConnector1,
Func<IObservable<T>, Task<TResult2>> aggregatorConnector2,
Func<TResult1, TResult2, TResult> resultSelector)
{
var s1 = new AggregatorSource<T>(); var c1 = aggregatorConnector1(s1);
var s2 = new AggregatorSource<T>(); var c2 = aggregatorConnector2(s2);
foreach (var item in source)
{
s1.Next(item);
s2.Next(item);
}
s1.Complete();
s2.Complete();
if (!c1.IsCompleted) {
throw new InvalidOperationException("Should have completed!");
}
if (!c2.IsCompleted) {
throw new InvalidOperationException("Should have completed!");
}
return resultSelector(c1.Value, c2.Value);
}
sealed class AggregatorSource<T> : IObservable<T>
{
readonly List<IObserver<T>> sinks = new List<IObserver<T>>();
public IDisposable Subscribe(IObserver<T> observer) {
sinks.Add(observer);
return Delegating.Disposable(() => sinks.Remove(observer));
}
public void Next(T item) => sinks.ForEach(s => s.OnNext(item));
public void Complete() => sinks.ForEach(s => s.OnCompleted());
}
}
And the async aggregation methods:
class Aggregator<T, TResult> : IObserver<T> {
private readonly TaskCompletionSource<T> CompletionSource = new TaskCompletionSource<T>();
public Task<T> Task => CompletionSource.Task;
private readonly Func<TResult> GetResult;
private readonly Action<T> AggregatorFunc;
public Aggregator(Action<T> aggregatorFunc, Func<TResult> getResult) {
GetResult = getResult;
AggregatorFunc = aggregatorFunc;
}
public void OnNext(T item) {
AggregatorFunc(item);
}
public void OnCompleted() {
CompletionSource.SetResult(GetResult());
}
public void OnError(Exception e) {
Source.SetException(e);
}
}
public static Task<int> CountAsync(this IObservable<T> src) {
int count = 0;
var aggregator = new Aggregator<T, int>(_ => count++, () => count);
src.Subscribe(aggregator);
return aggregator.Task;
}
The aggregation methods provided by Rx are not useful for us, since they will block and try to aggregate instantly. I'm surprised there is no Async overload for all of those operators.
Are you sure? I can perfectly await on the Sum
by Rx:
var sum = await Observable.Range(1, 10).Sum();
The async bits do muddy the waters conceptually. While one could argue that the aggregated results are future values, that won't be the case if we also decide to do a multi-Scan
version.
Actually, you've just given me another idea! 💡 I'll be back 😆 but the basis of it is that Rx already has Publish
that can be used to run concurrent aggregations:
var xs = Enumerable.Range(1, 10); // aggregate source
var source = xs.ToObservable().Publish(); // share
var sum = new int[1]; // by-ref
var count = new int[1]; // by-ref
source.Sum().Subscribe(a => sum[0] = a);
source.Count().Subscribe(a => count[0] = a);
source.Connect();
// pipeline above should have run in and entirely
// at time of connection so sum and count will
// have their final values
Console.WriteLine($"Sum = {sum[0]}, Count = {count[0]}");
@fsateler So with Aggregate
completely based on and defined in terms of IObservable<>
, we get this:
public static TResult Aggregate<TSource, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<TSource> source,
Func<IObservable<TSource>, IObservable<TResult1>> aggregatorSelector1,
Func<IObservable<TSource>, IObservable<TResult2>> aggregatorSelector2,
Func<IObservable<TSource>, IObservable<TResult3>> aggregatorSelector3,
Func<IObservable<TSource>, IObservable<TResult4>> aggregatorSelector4,
Func<IObservable<TSource>, IObservable<TResult5>> aggregatorSelector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector)
{
var r1 = new TResult1[1]; var s1 = new Subject<TSource>();
var r2 = new TResult2[1]; var s2 = new Subject<TSource>();
var r3 = new TResult3[1]; var s3 = new Subject<TSource>();
var r4 = new TResult4[1]; var s4 = new Subject<TSource>();
var r5 = new TResult5[1]; var s5 = new Subject<TSource>();
using (aggregatorSelector1(s1).Subscribe(Observer.Create((TResult1 r) => r1[0] = r)))
using (aggregatorSelector2(s2).Subscribe(Observer.Create((TResult2 r) => r2[0] = r)))
using (aggregatorSelector3(s3).Subscribe(Observer.Create((TResult3 r) => r3[0] = r)))
using (aggregatorSelector4(s4).Subscribe(Observer.Create((TResult4 r) => r4[0] = r)))
using (aggregatorSelector5(s5).Subscribe(Observer.Create((TResult5 r) => r5[0] = r)))
{
foreach (var item in source)
{
s1.OnNext(item);
s2.OnNext(item);
s3.OnNext(item);
s4.OnNext(item);
s5.OnNext(item);
}
s1.OnCompleted();
s2.OnCompleted();
s3.OnCompleted();
s4.OnCompleted();
s5.OnCompleted();
}
return resultSelector(r1[0], r2[0], r3[0], r4[0], r5[0]);
}
The big benefits are:
The usage is dead simple too and what we've been aiming for since early on:
Console.WriteLine(
"1,2,3,4,5,6,7,8,9,10"
.Split(',')
.Select(s => new { Str = s, Num = int.Parse(s) })
.Aggregate(
s => s.Select(e => e.Num).Sum(),
s => s.Count(),
s => s.Select(e => e.Num).Min(),
s => s.Select(e => e.Num).Max(),
s => s.Select(e => e.Str.Length).Distinct().ToArray(),
(sum, count, min, max, lengths) => new
{
Sum = sum,
Count = count,
Avgerage = (double) sum / count,
Min = min,
Max = max,
UniqueLengths
= "[" + string.Join(", ", lengths) + "]",
}
));
// outputs:
// { Sum = 55, Count = 10, Avgerage = 5.5, Min = 1, Max = 10, UniqueLengths = [1, 2] }
Note the special case of Distinct
that must be combined with ToArray
otherwise you only get back 2 (in fact with Distinct
alone you see that lengths
type is int
instead of int[]
), which is the last length among unique lengths. This may be a little non-obvious so Aggregate
could be made to throw a run-time error if an aggregator publishes more than one result.
A downside to the latest design is that any aggregator not shipping in Rx must be implemented as an observable. While Rx does make this easy, it does mean taking a dependency on it. Perhaps a good compromise is to offer the first approach with accumulators as simple functions (little reuse, aggregators must be implemented locally but very simple) in addition to one based on observables (makes rich composition possible, allows use of projection, filtering and aggregate methods are already part of Rx but slightly more complex).
The aggregation methods provided by Rx are not useful for us, since they will block and try to aggregate instantly. I'm surprised there is no Async overload for all of those operators.
Are you sure? I can perfectly await on the Sum by Rx
Oops, sorry. That's what I get from relying on memory and not testing. Indeed Sum
and the other aggregation methods return an IObservable<>
, and are thus awaitable (as long as Rx is installed, which you should otherwise you don't have Sum
;).
However, not all is well, because:
Note the special case of Distinct that must be combined with ToArray otherwise you only get back 2 (in fact with Distinct alone you see that lengths type is int instead of int[]), which is the last length among unique lengths. This may be a little non-obvious so Aggregate could be made to throw a run-time error if an aggregator publishes more than one result.
I see this as a major drawback. Aggregation methods returning the same type as selector methods is a serious flaw in the Rx API IMO. I wouldn't like to perpetuate that flaw in MoreLINQ (even if we catch it at runtime). Unfortunately, the only way out I see is to go back to Func<IObservable<TSource>, Task<>>
as aggregator selectors, and require adding ToTask
to all selectors:
Console.WriteLine(
"1,2,3,4,5,6,7,8,9,10"
.Split(',')
.Select(s => new { Str = s, Num = int.Parse(s) })
.Aggregate(
s => s.Select(e => e.Num).Sum().ToTask(),
s => s.Count().ToTask(),
s => s.Select(e => e.Num).Min().ToTask(),
s => s.Select(e => e.Num).Max().ToTask(),
s => s.Select(e => e.Str.Length).Distinct().ToList().ToTask(),
(sum, count, min, max, lengths) => new
{
Sum = sum,
Count = count,
Avgerage = (double) sum / count,
Min = min,
Max = max,
UniqueLengths
= "[" + string.Join(", ", lengths) + "]",
}
));
This makes invocations a bit cumbersome.
BTW, there is no need to have one subject per selector:
public static TResult Aggregate<TSource, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<TSource> source,
Func<IObservable<TSource>, Task<TResult1>> aggregatorSelector1,
Func<IObservable<TSource>, Task<TResult2>> aggregatorSelector2,
Func<IObservable<TSource>, Task<TResult3>> aggregatorSelector3,
Func<IObservable<TSource>, Task<TResult4>> aggregatorSelector4,
Func<IObservable<TSource>, Task<TResult5>> aggregatorSelector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector)
{
var s = new Subject<TSource>();
var c1 = aggregatorSelector1(s);
var c2 = aggregatorSelector2(s);
var c3 = aggregatorSelector3(s);
var c4 = aggregatorSelector4(s);
var c5 = aggregatorSelector5(s);
foreach (var item in source)
{
s.OnNext(item);
}
s.OnCompleted();
return resultSelector(c1.Result, c2.Result, c3.Result, c4.Result, c5.Result);
}
as long as Rx is installed, which you should otherwise you don't have
Sum
True although it isn't very hard to implement at all if you really don't wish to take the dependency.
Aggregation methods returning the same type as selector methods is a serious flaw in the Rx API IMO. I wouldn't like to perpetuate that flaw in MoreLINQ (even if we catch it at runtime).
I agree it would be best if types could capture the expectation that a single result should come out at the other end of the chain and while Task<>
does fit the bill (though somewhat feeling like a promise/future API of sorts), there's still one annoying issue. It requires that the aggregator must be combined with ToTask
to satisfy the signature and by doing so, the subscription becomes the responsibility of the observable-to-task mapping function. We can then no longer make run-time behaviour checks for violation of the protocol, that for example the task never observes more than one result. The Rx implementation returns the last observation. If we take the case of Distinct
, it means that you can get the wrong results if you forget to call ToList
or ToArray
. On the other hand, if Aggregate
becomes the observer then it can make run-time checks, like the one you also do to make sure that the task was indeed completed:
if (!c1.IsCompleted) {
throw new InvalidOperationException("Should have completed!");
}
A run-time check is better than a false result. A compile-time constrained API is even better but it requires introducing new types and providing our own implementation of all aggregators (plus filtering and projection).
Just for reference & discussion, here is what those run-time checks would look like:
public static TResult Aggregate<TSource, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<TSource> source,
Func<IObservable<TSource>, IObservable<TResult1>> aggregatorSelector1,
Func<IObservable<TSource>, IObservable<TResult2>> aggregatorSelector2,
Func<IObservable<TSource>, IObservable<TResult3>> aggregatorSelector3,
Func<IObservable<TSource>, IObservable<TResult4>> aggregatorSelector4,
Func<IObservable<TSource>, IObservable<TResult5>> aggregatorSelector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector)
{
var r1 = new (bool, TResult1)[1];
var r2 = new (bool, TResult2)[1];
var r3 = new (bool, TResult3)[1];
var r4 = new (bool, TResult4)[1];
var r5 = new (bool, TResult5)[1];
var subject = new Subject<TSource>();
IDisposable Subscribe<T>(int n, IObservable<T> aggregator, (bool Defined, T)[] r, string paramName) =>
ReferenceEquals(aggregator, subject)
? throw new ArgumentException($"Aggregator cannot be identical to the source.", paramName)
: aggregator.Subscribe(s =>
r[0] = r[0].Defined
? throw new InvalidOperationException($"Aggregator #{n} produced multiple results when only one is allowed.")
: (true, s));
using (Subscribe(1, aggregatorSelector1(subject), r1, nameof(aggregatorSelector1)))
using (Subscribe(2, aggregatorSelector2(subject), r2, nameof(aggregatorSelector2)))
using (Subscribe(3, aggregatorSelector3(subject), r3, nameof(aggregatorSelector3)))
using (Subscribe(4, aggregatorSelector4(subject), r4, nameof(aggregatorSelector4)))
using (Subscribe(5, aggregatorSelector5(subject), r5, nameof(aggregatorSelector5)))
{
foreach (var item in source)
subject.OnNext(item);
subject.OnCompleted();
}
T Get<T>(int n, (bool Defined, T Value) result) =>
!result.Defined
? throw new InvalidOperationException($"Aggregator #{n} has an undefined result.")
: result.Value;
return resultSelector(Get(1, r1[0]), Get(2, r2[0]), Get(3, r3[0]), Get(4, r4[0]), Get(5, r5[0]));
}
Of course, the same can be done with tasks (which helps remove the by-reference/boxing array hack) but it would nonetheless remain an internal implementation detail:
public static TResult Aggregate<TSource, TResult1, TResult2, TResult3, TResult4, TResult5, TResult>(
this IEnumerable<TSource> source,
Func<IObservable<TSource>, IObservable<TResult1>> aggregatorSelector1,
Func<IObservable<TSource>, IObservable<TResult2>> aggregatorSelector2,
Func<IObservable<TSource>, IObservable<TResult3>> aggregatorSelector3,
Func<IObservable<TSource>, IObservable<TResult4>> aggregatorSelector4,
Func<IObservable<TSource>, IObservable<TResult5>> aggregatorSelector5,
Func<TResult1, TResult2, TResult3, TResult4, TResult5, TResult> resultSelector)
{
var r1 = new TaskCompletionSource<TResult1>();
var r2 = new TaskCompletionSource<TResult2>();
var r3 = new TaskCompletionSource<TResult3>();
var r4 = new TaskCompletionSource<TResult4>();
var r5 = new TaskCompletionSource<TResult5>();
var subject = new Subject<TSource>();
IDisposable Subscribe<T>(int n, IObservable<T> aggregator, TaskCompletionSource<T> r, string paramName) =>
ReferenceEquals(aggregator, subject)
? throw new ArgumentException($"Aggregator cannot be identical to the source.", paramName)
: aggregator.Subscribe(s =>
{
if (r.Task.IsCompleted)
throw new InvalidOperationException($"Aggregator #{n} produced multiple results when only one is allowed.");
r.SetResult(s);
});
using (Subscribe(1, aggregatorSelector1(subject), r1, nameof(aggregatorSelector1)))
using (Subscribe(2, aggregatorSelector2(subject), r2, nameof(aggregatorSelector2)))
using (Subscribe(3, aggregatorSelector3(subject), r3, nameof(aggregatorSelector3)))
using (Subscribe(4, aggregatorSelector4(subject), r4, nameof(aggregatorSelector4)))
using (Subscribe(5, aggregatorSelector5(subject), r5, nameof(aggregatorSelector5)))
{
foreach (var item in source)
subject.OnNext(item);
subject.OnCompleted();
}
T Get<T>(int n, Task<T> result) =>
!result.IsCompleted
? throw new InvalidOperationException($"Aggregator #{n} has an undefined result.")
: result.Result;
return resultSelector(Get(1, r1.Task), Get(2, r2.Task), Get(3, r3.Task), Get(4, r4.Task), Get(5, r5.Task));
}
We can externalize the checking by providing a ToTask
extension for IObservable<>
that allows for only a single result but this would come in conflict with a project using Rx and MoreLINQ (sigh). We could give it another name but then that doesn't guarantee that the aggregator selector function will call the right version!
BTW, there is no need to have one subject per selector
You're right. That was a leftover from a limitation of my own implementation of Subject<>
that evolved out of AggregatorSource<>
and which did not initially support multiple subscriptions. I wasn't using the one from Rx for my exploration.
Did this ever make it into MoreLinq? Or anywhere else? Looks like a great feature @atifaziz
@Magnus12 Not yet although I'd still like to see that happen. I don't know if you followed the entire thread, but @fsateler and I were still pondering over the pros and cons of the API design, especially around signatures. I think once that is done (which frankly is and has been the hardest part), the rest is just a question of rolling up the sleeves and getting down to the implementation and tests.
Now that we have introduced an experimental namespace with #209, I can imagine throwing this out in the wild and let real-world field use drive the final decisions rather than leave it hanging here.
Meanwhile, I did some experimentation (see RxScans) that toyed with the idea of “accumulators for observable sources that emit their intermediate results.“ There's a runnable example in there too. The RxScans project shows that if you're willing to convert an enumerable into an observable then you can pretty much get the effect of “multiple aggregations in a single iteration” today.
Did this ever make it into MoreLinq?
@Magnus12 This is now shipping in 3.2. Check out also the examples.
The idea I wish to propose/discuss is best illustrated through examples. Suppose you want to count and sum up the numbers in a sequence. With LINQ's
Count
andSum
, you'd do it like this:Suppose now you want to also get the average:
Unfortunately, using these methods individually means the source ends up being iterated multiple times, once per aggregation. You could use LINQ's
Aggregate
to do multiple aggregations while iterating through the source once:However, as you increase the number of aggregations, some of which may not be trivial, the code becomes less readable and the seeds & accumulators get pulled further apart. By adding
Aggregate
overloads that allow multiple seeds, accumulators and a projector of all results like below:one can write something more natural:
Let's add another two aggregations, like getting the minimum and maximum value in addition to the count, sum and average. With an overload like this:
one can write:
Note how using this approach, the existing
Math.Min
andMath.Max
can be used directly as accumulators. With C# 7's local functions, complex accumulators can be expressed simply, even generically, and reused locally: