morelinq / MoreLINQ

Extensions to LINQ to Objects
https://morelinq.github.io/
Apache License 2.0
3.63k stars 409 forks source link

FallbackIfEmpty overload with fallback factory #293

Open abatishchev opened 7 years ago

abatishchev commented 7 years ago

Hi, Currently FallBackIfEmpty has the following signature:

    public static IEnumerable<T> FallbackIfEmpty<T>(
    this IEnumerable<T> source,
    T fallback)

Does it mean if T is a method call private static T CreateDefault() it would be called anyway, even if source is not empty? I think yes, it would be.

So I propose to add an overload:

    public static IEnumerable<T> FallbackIfEmpty<T>(
    this IEnumerable<T> source,
    Func<T> fallbackFactory)

which would call it if only source is indeed empty.

Are you willing to accept a pr from me once agreed on the principles. Thanks!

fsateler commented 7 years ago

Indeed. In my local projects I have such an overload. I should probably clean it up a bit and post the PR...

abatishchev commented 7 years ago

Cool! Please add. Eager to assist/review if necessary.

atifaziz commented 7 years ago

There's a few and possibly more reusable and interesting ways to solve this.

First of all, the desired effect can be achieved today using the FallbackIfEmpty overload that takes a sequence of fallback values. If you're on the latest and greatest toys like C# 7 then you can use local functions together with Create from System.Interactive to supply the fallback values lazily via an iterator, like this:

var q =
    Enumerable.Empty<int>()
              .FallbackIfEmpty(EnumerableEx.Create(() => { return _(); IEnumerator<int> _()
              {
                  yield return 1;
                  yield return 2;
                  yield return 3;
              }}));

The only downside is that it may be cumbersome to have all the syntactic ceremony if you're just returning one or two fallback values. But before you dismiss it on that basis, keep in mind that the above example is a toy one. In a more complex & realistic example such as the next one, the ceremony takes the back seat:

var rates = Enumerable.Empty<(string Curr, double Rate)>();
var q =
    rates.FallbackIfEmpty(EnumerableEx.Create(() => { return _(); IEnumerator<(string Curr, double Rate)> _()
         {
             const string url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/euro.csv";
             var csv = new WebClient().DownloadString(url);
             var rows =
                from row in Regex.Split(csv, @"\r?\n").Skip(1 /* headers */)
                where row.Length > 0 // skip blank lines
                select row.Split(',').Fold((curr, rate) => (Curr: curr, Rate: rate))
                into row
                select (row.Curr.Trim('"'), double.Parse(row.Rate));
             return rows.GetEnumerator();
         }}));

Then again, you could just turn the whole rates download and parsing into a function:

static IEnumerable<(string Curr, double Rate)> DownloadRates(string url) =>
    from csv in new[] { new WebClient().DownloadString(url) }
    from row in Regex.Split(csv, @"\r?\n").Skip(1)
    where row.Length > 0
    select row.Split(',').Fold((curr, rate) => (Curr: curr, Rate: rate))
    into row
    select (row.Curr.Trim('"'), double.Parse(row.Rate));

and consequently don't need Create:

rates.FallbackIfEmpty(DownloadRates("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/euro.csv"));

But let's suppose that we'd still rather supply value factories. I would then opt for Lazy<T> instead of Func<T> (as proposed in #294). Besides yielding a semantically clearer signature, it also has the technical advantage of evaluating once and possibly taming side-effects. I would also go one step further and solve the problem once for all sequences-taking methods. What if we introduce a new operator that maps a sequence of Lazy<> to their values (lazily), like so?

public static IEnumerable<T> Values<T>(this IEnumerable<Lazy<T>> source) =>
    from v in source select v.Value;

Now that the return value is compatible with our FallbackIfEmpy overload taking a sequence of fallback values, we can just use Values as follows:

var q =
    Enumerable.Empty<int>()
              .FallbackIfEmpty(new[] { new Lazy<int>(() => 123),
                                       new Lazy<int>(() => 456) }.Values());

We could also introduce a (non-extension) params version of Values:

public static IEnumerable<T> Values<T>(params Lazy<T>[] lazies) =>
    lazies.Values();

This gets rid of the explicit array allocation at the call site:

var q =
    Enumerable.Empty<int>()
              .FallbackIfEmpty(MoreEnumerable.Values(
                                   new Lazy<int>(() => 123),
                                   new Lazy<int>(() => 456)));

And if we were to introduce another two helpers:

public static IEnumerable<Lazy<T>> Lazy<T>(params Func<T>[] factories) =>
    factories.Lazy();

public static IEnumerable<Lazy<T>> Lazy<T>(this IEnumerable<Func<T>> factories) =>
    from f in factories select new Lazy<T>(f);

then we get rid of explicit Lazy<> instantiation:

var q =
    Enumerable.Empty<int>()
              .FallbackIfEmpty(MoreEnumerable.Lazy(() => 123, () => 456).Values());

and now it doesn't read so bad. It's also very clear that those fallback values are served lazily.

Bottom line, rather than just think of this in the context of one method like FallbackIfEmpty and adding overloads and tests for just that, we have an opportunity here to add generalizations that may be come handy in many other scenarios.

abatishchev commented 7 years ago

It's a joy to read your hardcore C# :) however few thoughts:

atifaziz commented 7 years ago

@abatishchev Semantically, you don't want eager evaluation of fallback values assuming they're costly so you turn to lazy evaluation. You can choose to do this technically with Func<> or Lazy<> but semantically it's the latter. You're asking for call-by-need.

Lazy<T> has bigger overhead than Func<T> as I understand

Yes it does and it's getting better but it should still dwarf in comparison to the cost of computing the value. If you're worried about the cost of Lazy<> versus Func<> then your fallback values are too cheap. :wink: Perhaps we should explore this over a real example you can share?

And again semantically/methodologically, intend used of Lazy<T> is if not multi-threaded reads (where the overhead comes from) but at least multiple reads

If you don't want/need the cost of thread-safety synchronization then use Lazy​Thread​Safety​Mode.​None.

While intend of Func<T> as a fallback is strictly one-time read.

Wait, but that is Lazy<>.

Also Func<T> can substitute Lazy<T> but less in other way around

public static Func<T> ToFunc<T>(this Lazy<T> lazy) => () => lazy.Value;

What am I missing? :thinking:

So I would not replace an overload accepting Func<T> with an overload accepting Lazy<T> but instead have them side-by-side

We shouldn't boil this down to subjectivity or preferences of the caller where Func<> is for those who say tomayto and Lazy<> for those who say tomahto. I wouldn't add either 😆 because…

And that's my perspective, where the initial issue comes from, how to make this one-time fallback more effective.

And I think I've shown that FallbackIfEmpty is already there today by virtue of accepting a sequence that can be lazy. If you personally prefer Func<> over Lazy<> then just add the following one-liner to your project:

public static IEnumerable<T> Values<T>(params Func<T>[] factories) =>
   from f in factories select f();

and you're done:

xs = xs.FallbackIfEmpty(Values(() => 1, () => 2, () => 3));

You're only downside will be one extra allocation of an enumerable and eventually an enumerator if the fallback values are needed. In practical terms, though, I would put those allocations and costs relative to (supposedly) those of fallback computations.

Do you have a real example of how you came to your issue with FallbackIfEmpty? Did your fallback values come from a configuration in a file on disk or a remote database?

atifaziz commented 7 years ago

Indeed. In my local projects I have such an overload.

@fsateler Can you share an example of how you've ended up needing that overload in your project?

fsateler commented 7 years ago

First of all, sorry for the silence. It's been a busy week and I haven't had time to fully think about @atifaziz concerns. I hope to do that soon though.

Indeed. In my local projects I have such an overload.

@fsateler Can you share an example of how you've ended up needing that overload in your project?

I have used it for example in classes that cache an expensive (that is, db) source:

T GetById(int id)  => LocalCache.Where(i => i.Id == id).FallbackIfEmpty(Repository.GetById).Single();

As you noted in your reply, you could do that using C# 7 and local functions:

T GetById(int id) {
    return LocalCache.Where(i => i.Id == id).FallbackIfEmpty(goToRepo).Single();
    IEnumerable<T> goToRepo() {
        yield return Repository.GetById(id);
    }
}

But I prefer to have that local function in a single place (the FallbackIfEmpty overload) rather than anywhere a single fallback is needed.

PS: I will fully respond to the longer post later, but I'll just quickly note that the Values function you proposed was proposed in #151, so I agree it has value independent of the FallbackIfEmpty overloads.

fsateler commented 7 years ago

I would then opt for Lazy instead of Func (as proposed in #294). Besides yielding a semantically clearer signature, it also has the technical advantage of evaluating once and possibly taming side-effects.

But, is it always desired to evaluate once tame side effects? I don't think so. In particular, if the fallback function queries a mutable source, it may well be desirable to have the function be reevaluated on each enumeration.

If side-effect taming or single-evaluation is desired, one can always use Memoize to "freeze" the sequence. In the case of Func<T> parameters, it is trivial to memoize parameterless functions:

public static Func<T> Memoize<T>(this Func<T> original) {
    var lazy = new Lazy<T>(original);
    return () => lazy.Value;
}

I would also go one step further and solve the problem once for all sequences-taking methods.

I agree with the general direction of your argument, but I disagree that the building block should be Lazy<T>, and instead think that deciding whether side-effects should be repeated or not is orthogonal to creating a sequence to iterate over, and thus should be solved by a separate method.

As I said back when FallbackIfEmpty was implemented, I'm not opposed to having a FromDelegates method and use composition instead of additional overloads. However, I do think that the FallbackIfEmpty case with a few (expensive) elements is common enough that it would be nice to have the convenience overload, just as we have the convenience overloads for a low number of already-known or cheaply-calculated values.

fsateler commented 7 years ago

However, I do think that the FallbackIfEmpty case with a few (expensive) elements is common enough that it would be nice to have the convenience overload

I forgot to add, it is so common we already have a SingleOrFallback (which we declared obsolete once FallbackIfEmpty was in place).

abatishchev commented 7 years ago

I'm sorry for silence too, last week before ​paternity leave as expected got to be a busy one.

I disagree that the building block should be Lazy, and instead think that deciding whether side-effects should be repeated or not is orthogonal to creating a sequence to iterate over, and thus should be solved by a separate method.

Can't agree more. Not just due to small but overhead but first of semantically. Also original LINQ methods are all Funx<T> based. I'm welcoming an overload accepting Lazy<T> but would love to see it an overload but not the only way to use this method. Also like the idea t leverage the composition whenever possible.

Is it always desired to evaluate once tame side effects? I don't think so.

It depends. In other words - we don't know. That's why I would leave that to the consumer to decide the desired behavior. So having two overloads would cover both scenarios.

atifaziz commented 7 years ago

@fsateler Thanks for sharing that GetById example. It always helps to have real cases to discuss. When I look at this (with some edits):

T GetById(int id)  =>
    LocalCache.Where(i => i.Id == id)
              .FallbackIfEmpty(() => Repository.GetById(id)) // correction
              .Single();

then it doesn't look like anything to do with sequences except that you start with one. You're really looking up a single value and failing that, you want to invoke some function to default one. We have ?? just for that purpose and I would have written the above simply as:

T GetById(int id) => LocalCache.SingleOrDefault(i => i.Id == id)
                  ?? Repository.GetById(id);

If you trust your cache to have uniquely identifiable objects then you can also use FirstOrDefault and it'll be faster than a Where in some cases, on average. I don't see the constraint but I'm assuming T is a reference type.

If we had a type like option in F# in the .NET Framework then we could have had something like TryFind in LINQ that returns an Option<> and an OrElse extension for Option<> that would allow substituting a value for the none case:

T GetById(int id) => LocalCache.TryFind(i => i.Id == id) // if this returns None...
                               .OrElse(() => Repository.GetById(id)); // ...then the lambda runs

But since that's not an option (pun, pun :wink:), ?? is the C# hammer for that.

So now I'm still waiting for a good example to drive this. ⏳

I forgot to add, it is so common we already have a SingleOrFallback (which we declared obsolete once FallbackIfEmpty was in place).

As I said back when FallbackIfEmpty was implemented, I'm not opposed to having a FromDelegates method and use composition instead of additional overloads.

Looks like we've already been here before and I should go back and refresh my memory so we don't keep repeating ourselves.

I seem to have also mislead this discussion by mentioning side-effects. It was just pointed out as a side benefit of using Lazy<> and never meant to be the primary motivation. Sticking to models and semantics, call-by-need is best represented and communicated in a signature via Lazy<>. From a functional purity perspective, I think the concern that the fallback can never vary with Lazy<> is an artificial one. I think the best way to move forward is to challenge each other through real examples to drive the design.

atifaziz commented 7 years ago

First of all, sorry for the silence.

I'm sorry for silence too, …

“Apologies accepted, Captain Needa”

@fsateler @abatishchev Guys, this is open source and volunteered time and work so no need to be sorry. 😆

fsateler commented 7 years ago

it doesn't look like anything to do with sequences except that you start with one. You're really looking up a single value and failing that, you want to invoke some function to default one.

Indeed, guilty as charged. In my defense, though, I imagined from the beginning that FallbackIfEmpty was going to be analogous to DefaultIfEmpty:

provide a FallbackIfEmpty method akin to DefaultIfEmpty, and then use Single, First, Last as needed.

The use cases in my mind always were about fetching a single value.

I don't see the constraint but I'm assuming T is a reference type.

In this case yes, it is.

I seem to have also mislead this discussion by mentioning side-effects. It was just pointed out as a side benefit of using Lazy<> and never meant to be the primary motivation.

While not the primary motivation, it would introduce a semantic difference between FallbackIfEmpty(otherSequence) and FallbackIfEmpty(() => GetFallbackValue()), in that the first call could enumerate otherSequence multiple times, but GetFallbackValue can only be ever called once.

From a functional purity perspective, I think the concern that the fallback can never vary with Lazy<> is an artificial one

I don't understand this comment. Could you elaborate?

Sticking to models and semantics, call-by-need is best represented and communicated in a signature via Lazy<>

Is it? Lazy<>'s short description is "Provides support for lazy initialization". Avoiding multiple evaluation is a core part of the class reason to exist, while call-by-need does not imply the same.

abatishchev commented 7 years ago

Regarding real-world scenarios, mine is quite simple: data source (Applier with U-SQL in Azure Data Lake) usually yields multiple rows but may also none. The layer above can't handle empty output since it relies on columns to present so it has to return a row with same columns as usually​ having nulls/default values. Currently this is implemented using if (any) foreach in results yield return else yield return empty. Replacement with LINQ is obvious but would pass code reviews only if were as efficient as possible. With the built-in DefaultIfEmpty I'd be told default row is always created and I have nothing to retort with.

abatishchev commented 7 years ago

Regarding Lazy<T>, I have mixed feeling. On one hand, an overload accepting Func<T> can call new Lazy<T>(func) and this way hide the implementation details, the fact that is uses Lazy<T> to achieve the declared goal of at-most-once evaluation. On other hand, I still feel that Lazy<T> has semantically different purpose: caller accesses Value multiple times (if once, the overhead is not worth it) plus an ability to choose the behavior regarding possible side effects of multi-threading (such as swallowing exceptions if any, etc.). While Func<T> means just dereffered evaluation. In the scope of DefaultIfEmpty it means fallback factory and that's it, return any ? source : default. So it my head the declared goal of the method matches the behavior of the underlying implementation.

fsateler commented 6 years ago

When https://github.com/morelinq/MoreLINQ/pull/333 is merged this can probably be closed, as it would be as simple as src.FallbackIfEmpty(MoreEnumerable.From(() => SomeExpensiveFunc())) .

abatishchev commented 6 years ago

Can it be as simple as src.FallbackIfEmpty(() => SomeExpensiveFunc()) on the surface and then call underneath whatever it's willing to?

atifaziz commented 6 years ago

…it would be as simple as src.FallbackIfEmpty(MoreEnumerable.From(() => SomeExpensiveFunc()))

@fsateler Or simpler yet: src.FallbackIfEmpty(MoreEnumerable.From(SomeExpensiveFunc))

abatishchev commented 6 years ago

Is there a way to drop MoreEnumerable.* from being required to be used?

atifaziz commented 6 years ago

@abatishchev Technically speaking, there's always a way. If you can't bare the current design and wish to just pass the function as a parameter, then I suggest you create a tiny local wrapper to make it easy on your eyes:

public static IEnumerable<T> FallbackIfEmpty<T>(
    this IEnumerable<T> source, Func<T> fallbackFactory) =>
    source.FallbackIfEmpty(MoreEnumerable.From(fallbackFactory));
atifaziz commented 6 years ago

@abatishchev And I should add that it's my bad (poor prioritisation or time management) for not getting back to @fsateler on the last points he raised and why this hasn't moved further along.