kevin-montrose / Cesil

Modern CSV (De)Serializer
MIT License
65 stars 3 forks source link

Alternatives to IEnumerable<DynamicCellValue> for ITypeDescriber.GetCellsForDynamicRow #6

Closed kevin-montrose closed 3 years ago

kevin-montrose commented 4 years ago

This was posed in Overthinking CSV With Cesil: Writing Dynamic Types.

ITypeDescriber.GetCellsForDynamicRow(in WriteContext, object) is invoked once per row when writing dynamic types. While a client could (in theory) reuse the allocated IEnumerable<T> across multiple calls, this is difficult (bordering on impossible) in the general case because when the returned value is no longer in use is impossible to know. As a consequence, the natural way to implement GetCellsForDynamicRow() requires at least one heap allocation per call (in fact, this is what DefaultTypeDescriber does).

Is there an alternative method (or methods) / return type / etc. that would allow for no allocation implementations of GetCellsForDynamicRow's functionality?

benjamin-hodgson commented 4 years ago

paraphrasing a discussion in chat:

I suggested something along the lines of my IRewritable design

int CountCellsForDynamicRow(in WriteContext writeContext, object row);
// copy the cells into the output span
void GetCellsForDynamicRow(in WriteContext writeContext, object row, Span<DynamicCellValue> output);

so the calling code would look something like (pseudocode):

var count = describer.CountCellsForDynamicRow(writeContext, row);
var buffer = ArrayPool.Get(count);
describer.GetCellsForDynamicRow(writeContext, row, buffer.AsSpan().Slice(0, count));

// do something with the buffer

ArrayPool.Release(buffer);

Passing the output buffer as a span means you don't have to worry about it getting stored anywhere, so you can reuse the buffer secure in the knowledge that it's not going to get stomped by client code. You can also do fun stuff like have the span point to stack memory (see post linked above, plus my lib https://github.com/benjamin-hodgson/FixedSizeBuffers), but that won't work if you have to go async.

Kevin pointed out that counting is not necessarily cheap when the DLR is involved. (For my original IRewritable use case, getting the count usually is cheap.) This suggests that it might be better for the library to make a guess at the size of the output span (probably based on the width of rows already written) and retry when it's not big enough.

// copy the cells into the output span and return the count
int GetCellsForDynamicRow(in WriteContext writeContext, object row, Span<DynamicCellValue> output);

Library code:

var buffer = ArrayPool.Get(GuessCellCount());
var count = describer.GetCellsForDynamicRow(writeContext, row, buffer.AsSpan());

if (count > buffer.Length)
{
    // oops, not enough space in the buffer
    ArrayPool.Release(buffer);
    buffer = ArrayPool.Get(count);
    describer.GetCellsForDynamicRow(writeContext, row, buffer.AsSpan().Slice(0, count));
}

// do something with the buffer

ArrayPool.Release(buffer);

This way you only have to make multiple passes in the unlikely case that the row is longer than previous rows. It does make the ITypeDescriber implementation a little bit messier (it has to be prepared for the case that the span isn't big enough).

kevin-montrose commented 4 years ago

I'm planning on trying out this suggestion this weekend, now that #21 is finished.

kevin-montrose commented 4 years ago

First whack on this branch.

Still needs a little more work, in particular it currently guesses 8-cells per-row which is silly. There are a few other allocations that can get nuked along with this change as well, I think.

kevin-montrose commented 4 years ago

@benjamin-hodgson proper change as a PR now

One thing I'm mulling over before merging - I'm pulling the Span<DynamicCellValue> out of an ArrayPool<DynamicCellValue> (now added to Options), but I could take a MemoryPool<DynamicCellValue>. I went with ArrayPool<T> just 'cause that's where my head was after our chat.

I kind of like using MemoryPool<T> since IMemoryOwner<T> is a little nicer to work with. I'd have to give it a proper name on Options I supposed, can't have two MemoryPools. Unlike MemoryPool<char> it's not really possible for a consumer to back the provided Span<DynamicCellValue>s with anything but arrays though.

Any thoughts?

benjamin-hodgson commented 4 years ago

I think the default memory pool is just a wrapper for an array pool anyway.

benjamin-hodgson commented 4 years ago

I can imagine there being a use case for backing a MemoryPool by an arena or some other sort of custom managed allocator

kevin-montrose commented 4 years ago

Yeah, default MemoryPool<T> delegates to ArrayPool<T>. A meaningful difference did occur to me though, a MemoryPool<T> lets you serve multiple requests out of a single array, whereas ArrayPool<T> you have to hand back a unique array on each call (until they're returned anyway).

kevin-montrose commented 4 years ago

24 has now been merged into vNext

benjamin-hodgson commented 4 years ago

@kevin-montrose The default MemoryPool allocates the IMemoryOwner when you call Rent. https://github.com/dotnet/runtime/blob/3155ca2dfa20e11949f24e439b54ba1816be3c22/src/libraries/System.Memory/src/System/Buffers/ArrayMemoryPool.cs#L22

kevin-montrose commented 4 years ago

@benjamin-hodgson yeah, that's not great. The interface is more flexible, but the more I look at the implementation the more I'm tempted to actually implement a custom MemoryPool<T>...

kevin-montrose commented 3 years ago

This has shipped.