kevin-montrose / Cesil

Modern CSV (De)Serializer
MIT License
65 stars 3 forks source link

Is there anything missing from IReader(Async) and IWriter(Async)? #11

Open kevin-montrose opened 4 years ago

kevin-montrose commented 4 years ago

This was posed in Overthinking CSV With Cesil: A “Modern” Interface and repeated in Overthinking CSV With Cesil: Reading Known Types.

Basically, is there anything missing from IReader, IWriter, or their async equivalents?

CesilUtils also exposes helpers for most of those methods, so failings in it may also imply failings in the interfaces.

This is an Open Question, which means:

[A]s part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Tornhoof commented 4 years ago

A few questions (about the Async interfaces, I didn't look much into the implementations):

Isn't this missing the cancellationtoken and if yes, you probably need that special attribute [EnumeratorCancellation]?

https://github.com/kevin-montrose/Cesil/blob/cfd387fd0d64e51a14ffbd692658ff6fc8a6bd15/Cesil/Interface/IAsyncReader.cs#L22

  1. Why the class constraint? to make the IL gen easier? https://github.com/kevin-montrose/Cesil/blob/cfd387fd0d64e51a14ffbd692658ff6fc8a6bd15/Cesil/Interface/IAsyncReader.cs#L32-L33

  2. ref in async is a bit strange, do you need the ref for your 'If need be, row will be initialized before'? An actual async implementation (with awaits) can't implement that interface because 'async methods can't have ref parameters' https://github.com/kevin-montrose/Cesil/blob/cfd387fd0d64e51a14ffbd692658ff6fc8a6bd15/Cesil/Interface/IAsyncReader.cs#L65

  3. As for IAsyncWriter<TRow>, did you consider returning the rowcount of written rows, might be useful for logging.

  4. The name cancel for the cancellationToken is interesting, I've seen cancellationToken and token but never cancel I think.

kevin-montrose commented 4 years ago
  1. Isn't this missing the cancellationtoken and if yes, you probably need that special attribute [EnumeratorCancellation]?

This is kind of a tricky bit of async enumerators, but I believe not taking a CancellationToken is correct here.

We're returning an IAsyncEnumerable, CancellationToken is used by a IAsyncEnumerator. You use [EnumeratorCancellation] to ensure that compiler generated code knows where it should be pulling a token from when it implicitly calls GetAsyncEnumerator(...). But none of this code is compiler generated, and there is no call (implicit or otherwise) to GetAsyncEnumerator()

Put another way, EnumerateAsync() doesn't actually do any work or start any tasks - work is deferred until the first call to IAsyncEnumerator<T>.MoveNextAsync().

If you're not going to explicit invoke GetAsyncEnumerator(), .NET has WithCancellation to bind a CancellationToken to an async enumerable.

  1. Why the class constraint? to make the IL gen easier?

Ah, that is to prevent a footgun - if TCollection is a value type, you're actually passing a copy and any mutation isn't going to be reflected in the original value.

I suppose by only passing the collection around internally by ref we can avoid breaking value type collections, though it'd still be up to the client to make sure they always look at the returned, probably modified, copy.

It really is a question of if supporting value type collections is worth extra complexity (there may be a tiny perf hit due to the extra indirection, but not enough to matter IMO). I'm kind of leaning towards "yes, remove the : class constraint". What do you think?

  1. ref in async is a bit strange, do you need the ref for your 'If need be, row will be initialized before'? An actual async implementation (with awaits) can't implement that interface because 'async methods can't have ref parameters'.

TryReadWithReuseAsync has a ref parameter to ease reusing a row, which is kind of necessary if there's a value type TRow. It's also a good symmetry with the sync version.

You're technically correct that an async method cannot have ref parameters, but an interface cannot require a method be implemented with the async keyword. What IAsyncReader<TRow> demands is that the method returns ValueTask<TRow> and has the given parameters. This is a subtle distinction, but important.

The way I implement TryReadWithReuseAsync is to attempt to complete synchronously (without blocking, of course), but if Cesil cannot complete to then 1) make sure a row is available (which may write to the ref param) 2) invoke a truly async implementation, passing the known to be initialized row in as a plain parameter. This means that whenever TryReadWithReuseAsync returned, regardless of whether it returns a pending or completed task, all the writes to the ref parameter (though not necessarily the value it points to) have completed.

You didn't ask, but it sort of naturally follows: "why doesn't TryReadAsync() have an out parameter then?". It's becauseTryReadWithReuseAsync can legally fail to use the ref'd row but TryReadAsync must put it's value in the out param, and that isn't always possible if TRow is provided via a constructor that takes parameters.

  1. As for IAsyncWriter, did you consider returning the rowcount of written rows, might be useful for logging.

That's a good idea, any of the "write multiple" methods really should return a written row count.

  1. The name cancel for the cancellationToken is interesting, I've seen cancellationToken and token but never cancel I think.

That is also a good point, fortunately I have some tests around parameter names... so cancellationToken can be easily enforced everywhere.

Tornhoof commented 4 years ago

Thank you for your detailed answer and I think it dawned on me, the consumer never ever implements these interfaces, he only consumes them, right? I looked at these interfaces, I looked at your examples and somehow assumed for some advanced usage the user is required to have their own implementations of the interfaces, but looking through more of your examples, I realized that I'm on the wrong track.

1. This is kind of a tricky bit of async enumerators, but I believe not taking a CancellationToken is correct here.

After looking at your implementation, you return a custom type for it AsyncEnumerable and at @mgravell's excellent blog post about async iterators https://blog.marcgravell.com/2020/05/the-anatomy-of-async-iterators-aka.html (your case is near the end, SomeSourceEnumerable), I think you're correct for your implementation. My assumption with the CancellationToken is only true if someone else would need to implement that interface.

2. I'm kind of leaning towards "yes, remove the : class constraint". What do you think?

I think it would be useful to remove the constraint, the only collection type I commonly use which is a struct is ImmutableArray<T>. I can't think of any other common collection which is a struct otherwise.

3. The way I implement TryReadWithReuseAsync is to attempt to complete synchronously (without blocking, of course)

Yes, my point is mood if only you implement the interface, because while the sync completion pattern for ValueTask via IsCompletedSuccessfully is neat, I would not want anyone else to implement that as there are certain complexities there.

You didn't ask, but it sort of naturally follows: "why doesn't TryReadAsync() have an out parameter then?".

Yeah I saw that and just assumed it's about symmetry with your other Try methods. I have to admit I'm not a real fan of changing the TryXXX(... out var y) pattern to ref, simple because if read TryXXX I automatically assume out simply because it's the normal pattern. I expect that the Reuse variety of your methods are for the advanced users who care much about allocations etc. and so the visibility of that change is probably limited.

And that reminds me of something I forgot about in IAsyncWriter, I think your IAsyncWriter interface could profit from ValueTask WriteCommentAsync(ReadOnlyMemory<char> comment, CancellationToken cancel = default). For quite a few Async implementations (most notably Stream.WriteAsync), ReadOnlyMemory overloads are added to improve allocations. If you want to be fancy you can use default interface implementation for that

    ValueTask WriteCommentAsync(string comment, CancellationToken cancel = default) =>
        WriteCommentAsync(comment.AsMemory(), cancel);
    ValueTask WriteCommentAsync(ReadOnlyMemory<char> comment, CancellationToken cancel = default);

Apropos, I think you can safely increase the TFM of your cesil.csproj to 3.1, as 3.0 is out of maintenance

kevin-montrose commented 4 years ago

Thank you for your detailed answer and I think it dawned on me, the consumer never ever implements these interfaces, he only consumes them, right?

That is correct, the interfaces are only meant to be implemented by Cesil code.

I think it would be useful to remove the constraint, the only collection type I commonly use which is a struct is ImmutableArray. I can't think of any other common collection which is a struct otherwise.

Immutable collections are actually even trickier, to add to them you have to capture the return value but since there's no common interface for the immutable collections it's not really possible to do that generically (that I have found anyway, would love to be proven wrong there). Instead you have to pass the builders (like ImmutableArray<T>.Builder which all implement ICollection<T>.

Cesil does have code to detect that a known immutable collection has been passed, and throw with advice to pass a builder instead.

Still planning to explore removing the class constraint, of course.

I expect that the Reuse variety of your methods are for the advanced users who care much about allocations etc. and so the visibility of that change is probably limited.

Yeah, they're threading a narrow use case: you really care about allocations, but don't want to go so far as to explicit implement object pooling as part of a custom InstanceProvider. Since the compiler will poke you about the different between ref and out (you can't TryXXX(ref var result), for example) I think misuse will be tricky.

And that reminds me of something I forgot about in IAsyncWriter, I think your IAsyncWriter interface could profit from ValueTask WriteCommentAsync(ReadOnlyMemory comment, CancellationToken cancel = default). For quite a few Async implementations (most notably Stream.WriteAsync), ReadOnlyMemory overloads are added to improve allocations.

That's a good idea, Cesil should support ReadOnlyMemory<char> on IAsyncWriter and ReadOnlySpan<char> on IWriter for comments. Writing loads of comments is... unusual, but no reason it should force an allocation.

mgravell commented 4 years ago

I can't think of any other common collection which is a struct otherwise.

If we're just talking about consuming: ArraySegment<T>, [ReadOnly]Memory<T> and possibly [ReadOnly]Span<T>; but for read/write: not so much. And indeed, for immutable things there is no one pattern that works for everything. For protobuf-net I had to add a ton of code to support all the common immutable types.

Tornhoof commented 4 years ago

Ah yes, I forgot ArraySegment, the ugly stepchild of array that existed for aeons but was pretty much unused until the whole ArrayPool and Span/Memory story started. I unfortunately don't know any good way for Immutable Collections support either, for ImmutableList, ImmutableArray, Immutable[Sorted]Dictionary you can use .Empty.AddRange, but that again is not supported for Sets.

Cesil does have code to detect that a known immutable collection has been passed, and throw with advice to pass a builder instead.

I personally think that's a nice solution for that problem.

kevin-montrose commented 4 years ago

There's now a PR making WriteAll(Async) return the number of written rows: #13.

Needs a bit more testing before getting merged in, but it's in progress.

kevin-montrose commented 4 years ago

I spent some time exploring removing the class constraint from TCollection on the reader interfaces, and there's actually a pretty big stumbling block. If IAsyncReader<TRow>.ReadAllAsync can't complete synchronously it has to make a copy of the given TCollection - it can't just stash a ref to it somewhere (it might be on the stack, and the stack frame is about to go bye-bye for one).

Since TCollection is mutable by definition, that's a giant issue (and would be a humongous one if it happened to work in the synchronous completion cases by accident). I think I have to keep the class restriction at least on IAsyncReader<TRow>.ReadAllAsync(...).

None of the above applies to IReader<TRow>.ReadAll(...), but I'm now inclined to keep it there too. The symmetry is desirable for one, but I also worry that it'd frustrate clients converting sync paths to async ones. Thoughts?

Tornhoof commented 4 years ago

The symmetry is desirable for one, but I also worry that it'd frustrate clients converting sync paths to async ones. Thoughts? I agree, keeping it both is preferable then.

kevin-montrose commented 4 years ago

New comment overloads have been implemented in #16 .

I think that covers all feedback to date on this issue, with the exception of moving from .NET Core 3.0 to 3.1. That is planned, I'm just going to wait until after the (already drafted) blog posts that explicitly refer to 3.0 are all published.

I'm going to spend some time getting all PRs merged into vNext in the near future.