Closed julienasp closed 7 months ago
@atifaziz @leandromoh @Orace if anyone is available for a quick review :)
At the very least, this should be two separate PRs.
HasDuplicates
as an operator has nothing to do withToReadOnlyCollection
, so combining them on a single PR is definitely improper.
I can understand why, you would prefer 2 PR, i'll keep that in mind for my future contributions. Thanks!
source.ScanBy(e => e, e => 0, (count, _, _) => count + 1, comparer) .Any(e => e is (_, > 1))
I can understand your concern, but isn't that the MoreLINQ mission to make it simple working with collections? And using DRY? the ScanBy solution works, and is efficient but lacks lisibility and abstraction.
You need to decipher the statement to understand that it does a HasDuplicates
An even simpler version that should be easy to decipher:
var strings = new[]
{
"FirstElement",
"DUPLICATED_STRING",
"ThirdElement",
"duplicated_string",
};
var hasDuplicates = strings
.CountBy(e => e, StringComparer.OrdinalIgnoreCase)
.Any(x => x.Value > 1);
Console.WriteLine($"Duplicates? {hasDuplicates}");
Or alternatively, not even using MoreLinq:
var strings = new[]
{
"FirstElement",
"DUPLICATED_STRING",
"ThirdElement",
"duplicated_string",
};
var hasDuplicates = strings
.GroupBy(e => e, StringComparer.OrdinalIgnoreCase)
.Any(g => g.Count() > 1);
Console.WriteLine($"Duplicates? {hasDuplicates}");
An even simpler version that should be easy to decipher:
Still lacks abstraction and breaks DRY as soon as you use it more than once.
I mean the goal [is/should be] to add Abstraction over commonly used LINQ pattern, helping with DRY and adding explicit scenarios in the linq ( extension ) API
a quick google search:
tons of upvotes, tons of way of doing it... some more efficient than others... and yet people are still wondering HOW to do it, because .NET has not yet abstracted it into a pretty abstraction like HasDuplicates.
The goal should not be to figure out how to do a HasDuplicates. It should be to find which nice library like moreLINQ can help me improve my productivity by skipping those google search all together ;)
If you do not agree with the above, you can close that PR.
Have a great day and thanks for the feedback
@viceroypenguin the solution using "countBy + any" or "groupBy + any" are simple solutions and may be fine for lot of cases but are naive implementation in terms of performance, because the entire sequence is consumed to check what could be done in a lazy manner, for example, if you find a duplicate for example in the 3rd element, so you must not iterate the rest of elements.
CountBy and GroupBy consumes the entire sequence, what in this scenario would spend unnecessary resources (memory and CPU).
@atifaziz "GroupBy + Count" was a naive alternative for CountBy, but when we added it the rational was tha same: to provide an efficient (therefore not naive) implementation and well tested abstraction for the user to consume.
"ScanBy + Any" does not enumerate unnecessary elements (lazy) but also is hard to read. When we added IndexBy (which is basically a wrapper for ScanBy) the rational was tha same: to hide the hard part to reason about and make a simple and well tested abstraction for the user to consume.
IMHO the HasDuplicates
is a good candidate to integrate Morelinq.
@julienasp thanks for your time and contribution. I did some comments in PR. Please, take a look.
Overloads with a selector (like the one first, min, max, etc have) are easy to implement and are a nice addition. I mean s.HasDuplicate(e => e.Id)
instead of s.Select(e => e.Id).HasDuplicate()
.
Overloads with a selector (like the one first, min, max, etc have) are easy to implement and are a nice addition. I mean
s.HasDuplicate(e => e.Id)
instead ofs.Select(e => e.Id).HasDuplicate()
.
i added 2 overloads to enable projection
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
e98d632
) 92.57% compared to head (d5b89f7
) 92.60%. Report is 20 commits behind head on master.:exclamation: Current head d5b89f7 differs from pull request most recent head 8b65e2f. Consider uploading reports for the commit 8b65e2f to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I just wanted to drop a short note to say that I don't mean to give this any silent treatment. I have been travelling and will be AFK for most of the next two weeks. If I do get a moment or two, I will try and follow-up. If something like HasDuplicates
hasn't landed in .NET for over a decade (based on the quoted ages of the StackOverflow questions), then hopefully a few more weeks won't kill anyone. 😅
Announcement (I will delete this post on request): HasDuplicates
was added to EnumerationQuest v1.1.0.
Announcement (I will delete this post on request):
HasDuplicates
was added to EnumerationQuest v1.1.0.
@Orace Nice library! I may have to use that sometime...
HasDuplicates
has a conflicting name with the one here; would GetHasDuplicates()
be better?
Announcement It's also been merged into SuperLinq (thanks to @julienasp for copying his PR over there); to be released with v5.1.0 at the end of the month.
@viceroypenguin, it actually is GetHasDuplicates
/ AndHasDuplicates
.
(hasDuplicate, min, max) = e.GetHasDuplicates().AndMin().AndMax();
(min, max, hasDuplicate) = e.GetMin().AndMax().AndHasDuplicates();
@viceroypenguin, it actually is
GetHasDuplicates
/AndHasDuplicates
.(hasDuplicate, min, max) = e.GetHasDuplicates().AndMin().AndMax(); (min, max, hasDuplicate) = e.GetMin().AndMax().AndHasDuplicates();
@Orace Ah, I just saw HasDuplicates
as a method in one of the files, so I jumped from that to the idea that the IEnumerable<>
extension would be HasDuplicates
as well. Apologies! :)
I just wanted to drop a short note to say that I don't mean to give this any silent treatment. I have been travelling and will be AFK for most of the next two weeks. If I do get a moment or two, I will try and follow-up. If something like
HasDuplicates
hasn't landed in .NET for over a decade (based on the quoted ages of the StackOverflow questions), then hopefully a few more weeks won't kill anyone. 😅
if you have time :)
if you have time :)
I'll try to find some this week.
if you have time :)
I'll try to find some this week.
friendly reminder 📆
if you have time :)
I'll try to find some this week.
friendly reminder 📆
Monthly reminder 👯♂️
if you have time :)
I'll try to find some this week.
friendly reminder 📆
Monthly reminder 👯♂️
Monthly reminder 🗓️
Announcement (I will delete this post on request):
HasDuplicates
was added to EnumerationQuest v1.1.0.
@Orace Lately there's been lots of advertising going on in the repo. One more won't hurt 😅 so don't worry about deleting it.
Hi @julienasp. Thanks for the friendly and regular reminders, and especially for your patience while I needed to attend to other things before reverting to this.
LINQ to Objects is missing a few desirable features. This project enhances LINQ to Objects with extra methods, in a manner which keeps to the spirit of LINQ. …, but isn't that the MoreLINQ mission to make it simple working with collections?
That's not the mission. It's to provide general and generic algorithms that operate on sequences and compose well with the rest of the ecosystem of operators to compute a desired result. The more general an operator is, the more reusable it is, and so yes:
And using DRY?
that is the DRY bit. If you have to repeatedly use some lambdas to vary and bend the algorithm to your actual need then that's the acceptable balance of repetition.
I mean the goal [is/should be] to add Abstraction over commonly used LINQ pattern, helping with DRY and adding explicit scenarios in the linq ( extension ) API
Abstractions can exist at several levels, but adding explicit scenarios contradicts that.
The first and foremost goal that remains is to encode unique sequence operators that encapsulate side-effects and which can be composed together. Sometimes, there's more than one way to skin a cat and as we've added more general operators to the library, we've found that some older one could be implemented in terms of newer and more general ones. The rule of thumb has been that if something takes 3 or more compositions and very broadly applicable then it may be worth adding.
Choose
is one such example, which is essentiallySelect
+Where
+Select
. We have to also consider the cost of maintaining each addition long after the original contributor has moved on.the ScanBy solution works, and is efficient but lacks lisibility and abstraction. You need to decipher the statement to understand that it does a HasDuplicates
I understand, but I think it's fair to ask the user to add a simple helper method to their project if it helps them to increase readability and avoid repetition. If you want to avoid doing that across several projects, I'd be happy to see someone maintain such small helpers as an add-on to MoreLINQ.
The goal should not be to figure out how to do a HasDuplicates.
The goal of MoreLINQ is to offer basic building blocks and sometimes you have to figure out how to combine those blocks to get the result you're after. The potential problem with adding it to MoreLINQ is scope creep. If we add
HasDuplicates
then the next person who wants a variation on it will ask for another version to be added. For example, beyond just knowing whether there are duplicates or not, one may want to:
- list of all duplicates
- list a sample of first N duplicates
- list of duplicates together with their index/position in the sequence
Imagine if everyone wanted to add a version that reads more abstract and readable added then soon you could get a spaghetti of extensions that you need to try and maintain coherently.
That all said, I think we can strike a compromise here by shooting for the most general case of streaming duplicates in a sequence. Then you can combine it with
Any()
to get the effect ofHasDuplicates
(albeit at the cost of a small allocation) while also leaving the door open for others to have their variations (like combining withTake
to get N samples or withIndex
to capture positions, etc.). If you're willing to reshape this into an extension calledDuplicates
then I'd be happy to consider this for the next release.Otherwise, you can vote and argue on issue dotnet/runtime#30582 for the same into .NET.
"GroupBy + Count" was a naive alternative for CountBy, but when we added it the rational was tha same: to provide an efficient (therefore not naive) implementation and well tested abstraction for the user to consume.
@leandromoh Actually, it was simply added as a parallel to
Seq.countBy
in F# (see #108). There was no other rationale or thinking behind it besides it would allow C# developers to have the same functionality without taking a dependency on F# libraries.When we added IndexBy (which is basically a wrapper for ScanBy) the rational was tha same: to hide the hard part to reason about and make a simple and well tested abstraction for the user to consume.
@leandromoh I just want to make sure that we recall things correctly here. We have always tried to see if something can be implemented more generally, and in that vein,
IndexBy
came beforeScanBy
. In fact,ScanBy
was born out of generalisingIndexBy
during the PR review. Once the generalization was added,IndexBy
was written in its terms. I am going through the same thought process here.
Thanks for the very detailed reply!
I really like your idea, so a Duplicates extension that returns all duplicates based on a predicate, and after that if they want to knows if there's any duplicates they can pair it with IsEmpty or Any
collection.Duplicates(x => x.Property).Any()
I'll work on that this week. Thanks again!
@atifaziz I'm sorry, but I'm going to disagree with you here.
.Duplicates()
does not seem to be a valuable operator, as it does not add anything (other than naming) that couldn't be done by calling .IndexBy(predicate).Where(x => x.Value > 1).Select(x => x.Key).DistinctBy(predicate)
, .CountBy(predicate).Where(x => x.Value > 1)
(depending on whether the writer prefers succinctness or performance/only consuming as many elements as necessary).
A .Duplicates()
operator seems overly focused and not generally useful enough, as compared to other operators already present to handle counting.
However, .HasDuplicates()
even though it is more focused, is more specifically and commonly useful (again, as evidenced by the number of people looking to know if there are any duplicates in a sequence).
Note that in this comment, the .net bcl person responsible for System.Linq prefers to use .CountBy()
as a means of implementing .Duplicates()
, and .CountBy()
is now available in the bcl, along with .AggregateBy()
.
For the comments arguing about consuming the entire sequence, I don't believe that this is a justifiable reason to include .Duplicates()
when it is trivially implemented using .IndexBy()
or .ScanBy()
.
Created a new PR here
but the NullArgumentTest are giving me trouble.
@atifaziz I'm sorry, but I'm going to disagree with you here.
@viceroypenguin You're entitled to disagree and your mileage may vary. I'm happy with neither adding Duplicates
nor HasDuplicates
because, as I said, you can achieve the same with ScanBy
and ScanBy
+ Any
the few times you need to do detect (and list) duplicates. In case ScanBy
reads tricky for others, I'd leave a comment in the source code and move on, but that's me. Frankly, I'd be even happier if people directed more energy towards issue dotnet/runtime#30582 so they can get it as part of .NET proper. 😅
Since @julienasp seems content with the compromise of adding Duplicates
instead, I'll let you debate with him and convince him otherwise. You have also added HasDuplicates
to your personal fork, so I don't understand what pain you're wanting to help us avoid down the road by preventing a more general and streaming version from being added here.
I can imagine the streaming version being potentially helpful to do a check on data for duplicates (thinking millions of records) before submitting an expensive computation that could, for example, cost money on the cloud. A streaming version can stop as soon as a duplicate has been detected and report it. For those working with small data or are okay with a Boolean answer, they can just compose with Any
. I don't see a reason to choose one over the other when you can have both.
Note that in this comment, the .net bcl person responsible for System.Linq prefers to use
.CountBy()
as a means of implementing.Duplicates()
, and.CountBy()
is now available in the bcl, along with.AggregateBy()
.
Unless I misread or missed something, I don't believe he ever expressed his preference in that comment. He just pointed out how people have been using Seq.countBy
for the purpose of duplicates detection, which is fine if full consumption of the sequence is not a concern. Also, preferences are subjective.
Ultimately, this seems to be about names like HasDuplicates
/Duplicates
just being more discoverable for those who may be largely unfamiliar with the operators and how to use them to achieve the same.
Superseded by #1037.
Add one new extension methods:
-HasDuplicates, an quick an efficient way to find out if you have any duplicates.