Feature Request: Support Span<T> Enumerator / Memory<T> in ObjectReader.Create

mgravell / fast-member

Automatically exported from code.google.com/p/fast-member

Apache License 2.0

1.02k stars 141 forks source link

Feature Request: Support Span<T> Enumerator / Memory<T> in ObjectReader.Create #79

Open joebone opened 4 years ago

joebone commented 4 years ago

In the case where we have massive Bulk Inserts pending ( which seems like a very common use case for this library ;) ), we often have to break those into smaller batches, often 100K or less from Lists or arrays of a couple of million.

Rather than allocating those intermediate arrays, it would be great to be able to pass a Sliced Span, or at the very least a Memory. Internally I see the private source is an Enumerator rather than the IEnumerable that is required by the method, and GetEnumerator is called at the end.

It would be beneficial to be able to either

re-use an ObjectReader instance, passing a different collection of the same type,
have an overload of ObjectReader.Create that takes Span / Memory
or at least be able to pass an IEnumerator ( such as this one: https://docs.microsoft.com/en-us/dotnet/api/system.span-1.getenumerator?view=netcore-3.1 ) directly.

Basic use case example of the intended use case:

private async Task BatchedBulkInsertAsync<T>(T[] arr, string tableName, string[] members, int chunkSize = 100000) {
  var mem = new ReadOnlyMemory<T>(arr);
  for (int i = 0; i < arr.Length; i += chunkSize) {
    var slice = mem.Slice(i, Math.Min(chunkSize, arr.Length - i));
    _logger.LogInformation("Inserting {size} {tableName} : Pos:{i}, Len:{pLength}, Total:{total} ", slice.Length, tableName);
    using (var bcp = new SqlBulkCopy(_connectionString))
    using (var reader = ObjectReader.Create<T>(slice.Span.GetEnumerator(), members)) {
      bcp.DestinationTableName = tableName;
      var sw = Stopwatch.StartNew();
      bcp.BulkCopyTimeout = 120;
      await bcp.WriteToServerAsync(reader);
      sw.Stop();
       _logger.LogWarning("Inserted {count} to {tableName} in {elapsed}", slice.Length, tableName, sw.Elapsed.TotalMilliseconds);
    }
}

mgravell commented 4 years ago

The notion of reusing the object instance has merit, but overall is unlikely to be a significant overhead in any scenarios that are talking about data access. I'd want to talk about perceived benefit before jumping in there, because then we also get into issues of lifetime management, disposal / recycling, etc. I suspect the complexity and risk outweigh any expected benefit.

We can't take Span<T> / Span<T>.Enumerator, as they are both ref struct and cannot be stored anywhere in this context (since we need to implement an interface here, we can't consider a custom ref struct implementation). In particular, re:

or at least be able to pass an IEnumerator such as Span<T>.GetEnumerator()

you'll notice that Span<T>.Enumerator is not in fact an IEnumerator, for this same reason.

Memory<T> is possible, but has significant issues here, as the .Span access should be considered relatively expensive - plus we'd need to slice or index into the span each time we progress the thing.

Overall, I'd say that it is unlikely that we can add huge benefit here. Is there a specific scenario that you're thinking of? Also, if there is something we want to do here, then in any scenario that is "bulk", I'd also be tempted to look past Span<T> / Memory<T>, and think more about ReadOnlySequence<T>, since most "bulk" scenarios usually benefit from discontiguous data support.

joebone commented 4 years ago

First off, thanks for the quick response, wasn't expecting you to be quite so on-the-ball. And to be fair, you may well have a point on the perception of benefit vs the real.

Our particular use case would probably benefit more from the object reuse/pooling on type than from the saved allocations.

The code sample I provided above is not far from the actual implementation we are using, witht he distinction that the fantasy line: var reader = ObjectReader.Create<T>(slice.Span.GetEnumerator(), members) is actually just implemented as a slice.ToArray().

Yea.

However I had up until this moment completely forgotten about ArraySegment, which is basically what I was looking for, and does implement the required interfaces..

So I guess the saving would be having to recreate the TypeAccessor and iterate the members each time - admittedly not a lot compared to the catastrophic copies of 100K elements to the LOH, but still.. I'm cheap. If I can avoid paying twice for the same work, I will.

Perhaps a change in scope to the original request then - a Recreate(newSource) or equivalent API that would set the private IEnumerator source?

mgravell commented 4 years ago

That still allocates, though. What is worth considering, then:

create an array-specific (including offset and count) subclass that does everything directly, not via GetEnumerator
make better use of reflection cache of the T is know to be a simple type

This is all pretty doable. Consuming Memory-T based code can always use TrgGetArray, and if unsuccessful, use the array-pool to flatten to an array. Useful?

On Mon, 16 Dec 2019, 14:07 joebone, notifications@github.com wrote:

First off, thanks for the quick response, wasn't expecting you to be quite so on-the-ball. And to be fair, you may well have a point on the perception of benefit vs the real.

Our particular use case would probably benefit more from the object reuse/pooling on type than from the saved allocations.

The code sample I provided above is not far from the actual implementation we are using, witht he distinction that the fantasy line: var reader = ObjectReader.Create(slice.Span.GetEnumerator(), members) is actually just implemented as a slice.ToArray().

Yea.

However I had up until this moment completely forgotten about ArraySegment, which is basically what I was looking for, and does implement the required interfaces..

So I guess the saving would be having to recreate the TypeAccessor and iterate the members each time - admittedly not a lot compared to the catastrophic copies of 100K elements to the LOH, but still.. I'm cheap. If I can avoid paying twice for the same work, I will.

Perhaps a change in scope to the original request then - a Recreate(newSource) or equivalent API that would set the private IEnumerator source?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mgravell/fast-member/issues/79?email_source=notifications&email_token=AAAEHMCNYULFJ47UATUTLY3QY6DT5A5CNFSM4J3IBXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG62FDA#issuecomment-566076044, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMH46G6Q436G4ESXYWTQY6DT5ANCNFSM4J3IBXMQ .

joebone commented 4 years ago

To be sure I'm understanding correctly, are you suggesting a subclass of objectreader?

I hadn't actually considered that, but yea, would allow more control at the cost of flexibility, although following the whole pit of success axiom it would be useful then to include an analyzer/fixer with the nuget package which advises when the more optimal path can be used(for discoverability purposes).

Or perhaps an internal path like Count on IEnumerable when it is an Ilist?

Either way, just having the class would definitely benefit us 😊.

I do feel a bit guilty about just crying for features without at least contributing though so maybe I'll branch and have a stab at an implementation myself, if for no other reason than practice.

mgravell commented 4 years ago

Well, since the Create method is static, I was assuming we could make the existing implementation code into one sub-class, the array-specific code into another, add a Create overload that takes an array and optional offset/count, and change the existing Create to check for arrays. With that, even people already using arrays get the new code for free.

On Mon, 16 Dec 2019, 17:34 joebone, notifications@github.com wrote:

To be sure I'm understanding correctly, are you suggesting a subclass of objectreader?

I hadn't actually considered that, but yea, would allow more control at the cost of flexibility, although following the whole pit of success axiom it would be useful then to include an analyzer/fixer with the nuget package which advises when the more optimal path can be used(for discoverability purposes).

Or perhaps an internal path like Count on IEnumerable when it is an Ilist?

Either way, just having the class would definitely benefit us 😊.

I do feel a bit guilty about just crying for features without at least contributing though so maybe I'll branch and have a stab at an implementation myself, if for no other reason than practice.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mgravell/fast-member/issues/79?email_source=notifications&email_token=AAAEHMDJ5JTHBWMAQSKOVZDQY63ZHA5CNFSM4J3IBXM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG7PLNA#issuecomment-566162868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMGU43GFODOVN2CCF23QY63ZHANCNFSM4J3IBXMQ .

joebone commented 4 years ago

Yea, whatever i was thinking was far more over-engineered than that :p Awesome, sounds great!