morelinq / MoreLINQ

Extensions to LINQ to Objects
https://morelinq.github.io/
Apache License 2.0
3.63k stars 409 forks source link

Adds HasDuplicates #1001

Closed julienasp closed 7 months ago

julienasp commented 1 year ago

Add one new extension methods:

-HasDuplicates, an quick an efficient way to find out if you have any duplicates.

julienasp commented 1 year ago

@atifaziz @leandromoh @Orace if anyone is available for a quick review :)

julienasp commented 1 year ago

At the very least, this should be two separate PRs. HasDuplicates as an operator has nothing to do with ToReadOnlyCollection, so combining them on a single PR is definitely improper.

I can understand why, you would prefer 2 PR, i'll keep that in mind for my future contributions. Thanks!

julienasp commented 1 year ago
source.ScanBy(e => e, e => 0, (count, _, _) => count + 1, comparer)
          .Any(e => e is (_, > 1))

I can understand your concern, but isn't that the MoreLINQ mission to make it simple working with collections? And using DRY? the ScanBy solution works, and is efficient but lacks lisibility and abstraction.

You need to decipher the statement to understand that it does a HasDuplicates

viceroypenguin commented 1 year ago

An even simpler version that should be easy to decipher:

var strings = new[]
{
    "FirstElement",
    "DUPLICATED_STRING",
    "ThirdElement",
    "duplicated_string",
};

var hasDuplicates = strings
  .CountBy(e => e, StringComparer.OrdinalIgnoreCase)
  .Any(x => x.Value > 1);

Console.WriteLine($"Duplicates? {hasDuplicates}");

Or alternatively, not even using MoreLinq:

var strings = new[]
{
    "FirstElement",
    "DUPLICATED_STRING",
    "ThirdElement",
    "duplicated_string",
};

var hasDuplicates = strings
  .GroupBy(e => e, StringComparer.OrdinalIgnoreCase)
  .Any(g => g.Count() > 1);

Console.WriteLine($"Duplicates? {hasDuplicates}");
julienasp commented 1 year ago

An even simpler version that should be easy to decipher:

Still lacks abstraction and breaks DRY as soon as you use it more than once.

I mean the goal [is/should be] to add Abstraction over commonly used LINQ pattern, helping with DRY and adding explicit scenarios in the linq ( extension ) API

a quick google search:

tons of upvotes, tons of way of doing it... some more efficient than others... and yet people are still wondering HOW to do it, because .NET has not yet abstracted it into a pretty abstraction like HasDuplicates.

The goal should not be to figure out how to do a HasDuplicates. It should be to find which nice library like moreLINQ can help me improve my productivity by skipping those google search all together ;)

If you do not agree with the above, you can close that PR.

Have a great day and thanks for the feedback

leandromoh commented 1 year ago

@viceroypenguin the solution using "countBy + any" or "groupBy + any" are simple solutions and may be fine for lot of cases but are naive implementation in terms of performance, because the entire sequence is consumed to check what could be done in a lazy manner, for example, if you find a duplicate for example in the 3rd element, so you must not iterate the rest of elements.

CountBy and GroupBy consumes the entire sequence, what in this scenario would spend unnecessary resources (memory and CPU).

@atifaziz "GroupBy + Count" was a naive alternative for CountBy, but when we added it the rational was tha same: to provide an efficient (therefore not naive) implementation and well tested abstraction for the user to consume.

"ScanBy + Any" does not enumerate unnecessary elements (lazy) but also is hard to read. When we added IndexBy (which is basically a wrapper for ScanBy) the rational was tha same: to hide the hard part to reason about and make a simple and well tested abstraction for the user to consume.

IMHO the HasDuplicates is a good candidate to integrate Morelinq.

@julienasp thanks for your time and contribution. I did some comments in PR. Please, take a look.

Orace commented 1 year ago

Overloads with a selector (like the one first, min, max, etc have) are easy to implement and are a nice addition. I mean s.HasDuplicate(e => e.Id) instead of s.Select(e => e.Id).HasDuplicate().

julienasp commented 1 year ago

Overloads with a selector (like the one first, min, max, etc have) are easy to implement and are a nice addition. I mean s.HasDuplicate(e => e.Id) instead of s.Select(e => e.Id).HasDuplicate().

i added 2 overloads to enable projection

codecov[bot] commented 1 year ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (e98d632) 92.57% compared to head (d5b89f7) 92.60%. Report is 20 commits behind head on master.

:exclamation: Current head d5b89f7 differs from pull request most recent head 8b65e2f. Consider uploading reports for the commit 8b65e2f to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1001 +/- ## ========================================== + Coverage 92.57% 92.60% +0.03% ========================================== Files 113 114 +1 Lines 3422 3437 +15 Branches 1055 1060 +5 ========================================== + Hits 3168 3183 +15 Misses 191 191 Partials 63 63 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

atifaziz commented 1 year ago

I just wanted to drop a short note to say that I don't mean to give this any silent treatment. I have been travelling and will be AFK for most of the next two weeks. If I do get a moment or two, I will try and follow-up. If something like HasDuplicates hasn't landed in .NET for over a decade (based on the quoted ages of the StackOverflow questions), then hopefully a few more weeks won't kill anyone. 😅

Orace commented 1 year ago

Announcement (I will delete this post on request): HasDuplicates was added to EnumerationQuest v1.1.0.

viceroypenguin commented 1 year ago

Announcement (I will delete this post on request): HasDuplicates was added to EnumerationQuest v1.1.0.

@Orace Nice library! I may have to use that sometime...

HasDuplicates has a conflicting name with the one here; would GetHasDuplicates() be better?

Announcement It's also been merged into SuperLinq (thanks to @julienasp for copying his PR over there); to be released with v5.1.0 at the end of the month.

Orace commented 1 year ago

@viceroypenguin, it actually is GetHasDuplicates / AndHasDuplicates.

(hasDuplicate, min, max) = e.GetHasDuplicates().AndMin().AndMax();
(min, max, hasDuplicate) = e.GetMin().AndMax().AndHasDuplicates();
viceroypenguin commented 1 year ago

@viceroypenguin, it actually is GetHasDuplicates / AndHasDuplicates.

(hasDuplicate, min, max) = e.GetHasDuplicates().AndMin().AndMax();
(min, max, hasDuplicate) = e.GetMin().AndMax().AndHasDuplicates();

@Orace Ah, I just saw HasDuplicates as a method in one of the files, so I jumped from that to the idea that the IEnumerable<> extension would be HasDuplicates as well. Apologies! :)

julienasp commented 1 year ago

I just wanted to drop a short note to say that I don't mean to give this any silent treatment. I have been travelling and will be AFK for most of the next two weeks. If I do get a moment or two, I will try and follow-up. If something like HasDuplicates hasn't landed in .NET for over a decade (based on the quoted ages of the StackOverflow questions), then hopefully a few more weeks won't kill anyone. 😅

if you have time :)

atifaziz commented 1 year ago

if you have time :)

I'll try to find some this week.

julienasp commented 10 months ago

if you have time :)

I'll try to find some this week.

friendly reminder 📆

julienasp commented 9 months ago

if you have time :)

I'll try to find some this week.

friendly reminder 📆

Monthly reminder 👯‍♂️

julienasp commented 8 months ago

if you have time :)

I'll try to find some this week.

friendly reminder 📆

Monthly reminder 👯‍♂️

Monthly reminder 🗓️

atifaziz commented 7 months ago

Announcement (I will delete this post on request): HasDuplicates was added to EnumerationQuest v1.1.0.

@Orace Lately there's been lots of advertising going on in the repo. One more won't hurt 😅 so don't worry about deleting it.

julienasp commented 7 months ago

Hi @julienasp. Thanks for the friendly and regular reminders, and especially for your patience while I needed to attend to other things before reverting to this.

LINQ to Objects is missing a few desirable features. This project enhances LINQ to Objects with extra methods, in a manner which keeps to the spirit of LINQ. …, but isn't that the MoreLINQ mission to make it simple working with collections?

That's not the mission. It's to provide general and generic algorithms that operate on sequences and compose well with the rest of the ecosystem of operators to compute a desired result. The more general an operator is, the more reusable it is, and so yes:

And using DRY?

that is the DRY bit. If you have to repeatedly use some lambdas to vary and bend the algorithm to your actual need then that's the acceptable balance of repetition.

I mean the goal [is/should be] to add Abstraction over commonly used LINQ pattern, helping with DRY and adding explicit scenarios in the linq ( extension ) API

Abstractions can exist at several levels, but adding explicit scenarios contradicts that.

The first and foremost goal that remains is to encode unique sequence operators that encapsulate side-effects and which can be composed together. Sometimes, there's more than one way to skin a cat and as we've added more general operators to the library, we've found that some older one could be implemented in terms of newer and more general ones. The rule of thumb has been that if something takes 3 or more compositions and very broadly applicable then it may be worth adding. Choose is one such example, which is essentially Select + Where + Select. We have to also consider the cost of maintaining each addition long after the original contributor has moved on.

the ScanBy solution works, and is efficient but lacks lisibility and abstraction. You need to decipher the statement to understand that it does a HasDuplicates

I understand, but I think it's fair to ask the user to add a simple helper method to their project if it helps them to increase readability and avoid repetition. If you want to avoid doing that across several projects, I'd be happy to see someone maintain such small helpers as an add-on to MoreLINQ.

The goal should not be to figure out how to do a HasDuplicates.

The goal of MoreLINQ is to offer basic building blocks and sometimes you have to figure out how to combine those blocks to get the result you're after. The potential problem with adding it to MoreLINQ is scope creep. If we add HasDuplicates then the next person who wants a variation on it will ask for another version to be added. For example, beyond just knowing whether there are duplicates or not, one may want to:

  • list of all duplicates
  • list a sample of first N duplicates
  • list of duplicates together with their index/position in the sequence

Imagine if everyone wanted to add a version that reads more abstract and readable added then soon you could get a spaghetti of extensions that you need to try and maintain coherently.

That all said, I think we can strike a compromise here by shooting for the most general case of streaming duplicates in a sequence. Then you can combine it with Any() to get the effect of HasDuplicates (albeit at the cost of a small allocation) while also leaving the door open for others to have their variations (like combining with Take to get N samples or with Index to capture positions, etc.). If you're willing to reshape this into an extension called Duplicates then I'd be happy to consider this for the next release.

Otherwise, you can vote and argue on issue dotnet/runtime#30582 for the same into .NET.

"GroupBy + Count" was a naive alternative for CountBy, but when we added it the rational was tha same: to provide an efficient (therefore not naive) implementation and well tested abstraction for the user to consume.

@leandromoh Actually, it was simply added as a parallel to Seq.countBy in F# (see #108). There was no other rationale or thinking behind it besides it would allow C# developers to have the same functionality without taking a dependency on F# libraries.

When we added IndexBy (which is basically a wrapper for ScanBy) the rational was tha same: to hide the hard part to reason about and make a simple and well tested abstraction for the user to consume.

@leandromoh I just want to make sure that we recall things correctly here. We have always tried to see if something can be implemented more generally, and in that vein, IndexBy came before ScanBy. In fact, ScanBy was born out of generalising IndexBy during the PR review. Once the generalization was added, IndexBy was written in its terms. I am going through the same thought process here.

Thanks for the very detailed reply!

I really like your idea, so a Duplicates extension that returns all duplicates based on a predicate, and after that if they want to knows if there's any duplicates they can pair it with IsEmpty or Any

collection.Duplicates(x => x.Property).Any()

I'll work on that this week. Thanks again!

viceroypenguin commented 7 months ago

@atifaziz I'm sorry, but I'm going to disagree with you here.

.Duplicates() does not seem to be a valuable operator, as it does not add anything (other than naming) that couldn't be done by calling .IndexBy(predicate).Where(x => x.Value > 1).Select(x => x.Key).DistinctBy(predicate), .CountBy(predicate).Where(x => x.Value > 1) (depending on whether the writer prefers succinctness or performance/only consuming as many elements as necessary).

A .Duplicates() operator seems overly focused and not generally useful enough, as compared to other operators already present to handle counting.

However, .HasDuplicates() even though it is more focused, is more specifically and commonly useful (again, as evidenced by the number of people looking to know if there are any duplicates in a sequence).

Note that in this comment, the .net bcl person responsible for System.Linq prefers to use .CountBy() as a means of implementing .Duplicates(), and .CountBy() is now available in the bcl, along with .AggregateBy().

For the comments arguing about consuming the entire sequence, I don't believe that this is a justifiable reason to include .Duplicates() when it is trivially implemented using .IndexBy() or .ScanBy().

julienasp commented 7 months ago

Created a new PR here

but the NullArgumentTest are giving me trouble.

atifaziz commented 7 months ago

@atifaziz I'm sorry, but I'm going to disagree with you here.

@viceroypenguin You're entitled to disagree and your mileage may vary. I'm happy with neither adding Duplicates nor HasDuplicates because, as I said, you can achieve the same with ScanBy and ScanBy + Any the few times you need to do detect (and list) duplicates. In case ScanBy reads tricky for others, I'd leave a comment in the source code and move on, but that's me. Frankly, I'd be even happier if people directed more energy towards issue dotnet/runtime#30582 so they can get it as part of .NET proper. 😅

Since @julienasp seems content with the compromise of adding Duplicates instead, I'll let you debate with him and convince him otherwise. You have also added HasDuplicates to your personal fork, so I don't understand what pain you're wanting to help us avoid down the road by preventing a more general and streaming version from being added here.

I can imagine the streaming version being potentially helpful to do a check on data for duplicates (thinking millions of records) before submitting an expensive computation that could, for example, cost money on the cloud. A streaming version can stop as soon as a duplicate has been detected and report it. For those working with small data or are okay with a Boolean answer, they can just compose with Any. I don't see a reason to choose one over the other when you can have both.

Note that in this comment, the .net bcl person responsible for System.Linq prefers to use .CountBy() as a means of implementing .Duplicates(), and .CountBy() is now available in the bcl, along with .AggregateBy().

Unless I misread or missed something, I don't believe he ever expressed his preference in that comment. He just pointed out how people have been using Seq.countBy for the purpose of duplicates detection, which is fine if full consumption of the sequence is not a concern. Also, preferences are subjective.

Ultimately, this seems to be about names like HasDuplicates/Duplicates just being more discoverable for those who may be largely unfamiliar with the operators and how to use them to achieve the same.

atifaziz commented 7 months ago

Superseded by #1037.