vibe-d / vibe-inet

Internet standards related base functionality (URL, HTML encoding, ...)
MIT License
1 stars 1 forks source link

Add a MultipartEntity definition for vibe.inet.webform. #2

Open vnayar opened 5 months ago

vnayar commented 5 months ago

This is an initial attempt to implement a standards-compliant multipart/form-data request.

See https://github.com/vibe-d/vibe-http/issues/32

The initial commit here is preliminary and will require iteration, due to a lack of familiarity on the author's part of Vibe's internal conventions and patterns.

s-ludwig commented 5 months ago

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

A way to mitigate this would be to change the entity/part types from class to template structs with appropriate factory functions:

auto multiPartEntity(HEADER_MAP, PARTS)(HEADER_MAP headers, PARTS parts) {
    return MultipartEntity!(HEADER_MAP, PARTS)(headers, parts);
}

auto multiPartEntity(PARTS)(string content_type, PARTS parts) {
    auto headers = only(tuple("Content-Type", content_type));
    return MultipartEntity!(typeof(headers), PARTS)(headers, parts);
}

auto formPart(string name, string value) {
    auto headers = only(...);
    return MultipartEntityPart!(typeof(headers), string)(headers, value);
}

auto formPart(string name, NativePath file_path) {
    // ...
}

struct MultipartEntity(HEADER_MAP, PARTS) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}

struct MultipartEntityPart(HEADER_MAP, VALUE) {
    HEADER_MAP headers;
    VALUE value;
}
vnayar commented 5 months ago

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

I wasn't fully aware of that. Originally I was hoping the use of classes would help make it easier to pass the object by reference and avoid copies, but wasn't aware that dynamic allocations are so much more expensive. What is it about the GC that makes them so costly?

Also, why use templates for the attribute types?

struct MultipartEntityPart(HEADER_MAP, VALUE) {
    HEADER_MAP headers;
    VALUE value;
}

Don't concrete types make it easier to utilize the values when writing them to a connection, such as in vibe-http, client.d? https://github.com/vnayar/vibe-http/blob/multipart-form-data-entities-rfc-2388/source/vibe/http/client.d#L909

Maybe I'm confused about the actual status of types like InetHeaders and InputStream. These are the types utilized within vibe, right?

The MultipartEntityPart body type of SumType!(string, InterfaceProxy!InputStream) was selected to permit a body to be chosen either by giving a fixed value (a common case with <input> and <select> tags in HTML forms, but also to chose an available file, or to choose data that a service already has in memory (e.g. a server calling an API endpoint, and submitting data that it got from elsewhere, such as by extracting an email or generating a file in memory using a library.)

s-ludwig commented 5 months ago

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

I wasn't fully aware of that. Originally I was hoping the use of classes would help make it easier to pass the object by reference and avoid copies, but wasn't aware that dynamic allocations are so much more expensive. What is it about the GC that makes them so costly?

There are two escalation stages with the current GC. The first one happens because the collection stage uses a stop-the-world approach and allocations require taking a global GC lock. In situations with high GC pressure or usage frequency, this can nullify any multi-threading performance gains.

The second stage happens when for some reason (probably memory pool fragmentation, but maybe also due to the way the GC selects a pool for a new allocation, I haven't looked close enough into the implementation) the GC cannot consistently reuse memory for new allocations. This then leads to infinite grown of allocated pools, which is a de-facto memory leak. I've observed this being triggered both in vibe.d web application contexts and in our photo application as soon as rapid allocations for more than a few bytes happen from multiple threads at once. This happens rather rarely, but when it does it's hard to track down.

Also, why use templates for the attribute types?

The idea is that it allows the user to avoid any heap/GC allocations whenever possible. So for example a static array or an only() range can be passed instead of a dynamic array. This can also be used on combinations, such as allocating default headers once in a dynamic array and then doing .join(only("Some-Header", value)) to dynamically append a header without having to allocate and without passing a lot of by-value memory.

Don't concrete types make it easier to utilize the values when writing them to a connection, such as in vibe-http, client.d? https://github.com/vnayar/vibe-http/blob/multipart-form-data-entities-rfc-2388/source/vibe/http/client.d#L909

Maybe I'm confused about the actual status of types like InetHeaders and InputStream. These are the types utilized within vibe, right?

The MultipartEntityPart body type of SumType!(string, InterfaceProxy!InputStream) was selected to permit a body to be chosen either by giving a fixed value (a common case with <input> and <select> tags in HTML forms, but also to chose an available file, or to choose data that a service already has in memory (e.g. a server calling an API endpoint, and submitting data that it got from elsewhere, such as by extracting an email or generating a file in memory using a library.)

I forgot to add the proper template constraints in my example. For HEADER_MAP this would be something like isStringMap!HEADER_MAP and for PARTS it could be isInputRange!PARTS && isMultipartBodyType(ElementType!PARTS). isMultipartBodyType could then accept string, const(ubyte)[], an isInputStream!T type, or a SumType!(...) of such types (possibly also Algebraic or TaggedUnion, if there is a common interface). Since both have an input range interface, the code to handle them shouldn't be much different from the code with concrete types.

Without an example this does become a bit more complicated to understand from the user's point of view, unfortunately, but having the high-level overloads for the common cases should mitigate that hopefully.

vnayar commented 2 months ago

I forgot to add the proper template constraints in my example. For HEADER_MAP this would be something like isStringMap!HEADER_MAP and for PARTS it could be isInputRange!PARTS && isMultipartBodyType(ElementType!PARTS). isMultipartBodyType could then accept string, const(ubyte)[], an isInputStream!T type, or a SumType!(...) of such types (possibly also Algebraic or TaggedUnion, if there is a common interface). Since both have an input range interface, the code to handle them shouldn't be much different from the code with concrete types.

In this particular case, I believe we won't be able to treat the parts as a range interface, because they lack a common element type. For example, a form might consist of 3 fields, a text input, a checkbox, and a multi-file upload. Thus a statement like isMultipartBodyType(ElementType!PARTS) wouldn't work, because there is no single element type for PARTS. I don't know if it's possible to create at compile-time a new SumType from the original part types. Is that what is needed? Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value).

According to the spec, the default Content-Type of a multipart/form-data field is text/plain. So checkboxes, options, as well as text fields should all use text/plain (although technically they don't have to) https://datatracker.ietf.org/doc/html/rfc2388#section-3 File upload seems to be the only mentioned case where the Content-Type is explicitly set, and it can be something like application/pdf or multipart/mixed and represent a set of files.

It's a bit of a shame that there's no MIME RFC2046 library that already exists that could be used for this purpose. (Hmm, perhaps I should begin there?)

s-ludwig commented 2 months ago

Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value)

Yeah, I'd say it makes sense to go from there. For the internal representation we should have that anyway, as Variant should really be a last-resort solution due to its less than ideal overall characteristics (GC allocations, virtual function calls, no nothrow/@safe).

I'd say that with changing to SumType and making the tests pass we can merge this and make further API refinements as separate steps.

vnayar commented 2 months ago

Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value)

Yeah, I'd say it makes sense to go from there. For the internal representation we should have that anyway, as Variant should really be a last-resort solution due to its less than ideal overall characteristics (GC allocations, virtual function calls, no nothrow/@safe).

I'd say that with changing to SumType and making the tests pass we can merge this and make further API refinements as separate steps.

I ran into a problem with this approach. Because the Header type is templated, and the stream type that a file can be in is also templated (inside FormFile(InputStreamT)), there's no way to make a single SumType, because to do so would require knowing the HeaderT and InputStreamT values in advance.

The only way I can think of to make this work is to provide the HeaderT and InputStreamT types as template-parameters to MultipartEntity, and this would force all the different parts of the multipart form to use the same header/stream types. Only Variant permits mixing and matching of header/stream types.

However, because the HeaderT and InputStreamT is sometimes determined by factory methods, and mixing would no longer be allowed, it would essentially mean only a single type is permitted, and there's no point offering it as a template-parameter. At this point, one might as well hard-code InetHeaderMap for HeaderT and InterfaceProxy!InputStream for InputStreamT.

How important is it to offer arbitrary range types in the interface? If it's very important, then maybe Variant should be kept. If it's not important, then the header and file stream types should probably be locked in and the body could then be SumType.

vnayar commented 2 months ago

@s-ludwig So what do you think about the problem of using SumType?

In a nutshell, if I have a list/range of Multiparts (which can be the body of a multipart entity), then I need a type for that variable. However, if I provide a type, then it has to have a concrete header and body type. But this means it cannot support ranges, which themselves have no concrete type. E.g. you can't have a concrete range of different types of ranges. You would either have to lock-down supported range types, like InterfaceProxy!InputStream, or use Variant. Which is better?

s-ludwig commented 2 months ago

Sorry for replying late, I actually started to write a reply two weeks ago but then got sidetracked. But now actually, after re-reading everything, I think that what I meant in my first comment was actually to have the PARTS template parameter to be variadic and just forgot about it when you brought up the range type issue and didn't see it in the code.

So it would look like this: auto multiPartEntity(HEADER_MAP, PARTS...)(HEADER_MAP headers, PARTS parts)

Parts can then be given as multiple individually typed arguments. Each argument could be a range of parts, or a single part. The user could choose the range element type to be whatever is needed. Although it may not actually be necessary at all, we could also support using SumType as range elements and match accordingly, so that the user has maximum flexibility here to represent the parts list.

vnayar commented 1 month ago

Sorry for replying late, I actually started to write a reply two weeks ago but then got sidetracked. But now actually, after re-reading everything, I think that what I meant in my first comment was actually to have the PARTS template parameter to be variadic and just forgot about it when you brought up the range type issue and didn't see it in the code.

So it would look like this: auto multiPartEntity(HEADER_MAP, PARTS...)(HEADER_MAP headers, PARTS parts)

Parts can then be given as multiple individually typed arguments. Each argument could be a range of parts, or a single part. The user could choose the range element type to be whatever is needed. Although it may not actually be necessary at all, we could also support using SumType as range elements and match accordingly, so that the user has maximum flexibility here to represent the parts list.

I originally had tried something like this, and the code is still commented out, but it would look something like this:

/// Returns a MultipartEntity with Content-Type "multipart/form-data" from parts as a compile-time sequence.
auto multipartFormData(MultipartR...)(MultipartR parts) {
    string boundaryStr = createBoundaryString();
    auto headers = only(
            tuple("Content-Type", "multipart/form-data; boundary=" ~ boundaryStr));
    return MultipartEntity!(typeof(headers))(headers: headers, bodyValue: parts, boundary: boundaryStr);
}

In this case, the compiler expands the variadic parameter parts, and you end up with an error like this:

source/vibe/inet/webform.d(794,42): Error: template `vibe.inet.webform.MultipartEntity!(OnlyResult!(Tuple!(string, string))).MultipartEntity.__ctor` is not callable using argument types `!()(OnlyResult!(Tuple!(string, string)), MultipartEntity!(OnlyResult!(Tuple!(string, string))), MultipartEntity!(OnlyResult!(Tuple!(string, string), Tuple!(string, string), Tuple!(string, string))), MultipartEntity!(OnlyResult!(Tuple!(string, string), Tuple!(string, string))), string)`

Shortening the header types as H and body type as B, the error is:

source/vibe/inet/webform.d(794,42): Error: template `vibe.inet.webform.MultipartEntity!(H).MultipartEntity.__ctor` is not callable using argument types `!()(H, B1, B2, B3, string)`

In essence, the same problem remains. A concrete type that can fit all the diverse body types is still needed. And those body types might wrap a file or other arbitrary range. Thus, the body variable, if it does not wish to use Variant, would still need a SumType that encompasses all the possible body content types that are permitted: string, ranges, inputstreams, etc.

Even Variant doesn't solve the problem, because when sending the actual form on an HTTP connection, you need a way to write the contents of the multipart parts, and without a type, that's not really an option. The only ideas I have right now would be to support InputProxy!InputStream and InputRangeObject, these are at least concrete types one can compile against.

vnayar commented 1 month ago

I added a new commit that hopefully makes things easier to reason about this. I got rid of Variant and replaced it with SumType, and the code compiles with the minimal tests running as well. To do this, I simply got rid of all the template parameters.

Starting from something that's working, maybe it would be easier to specify, one at a time, what template parameters should exist.

s-ludwig commented 1 month ago

return MultipartEntity!(typeof(headers))(headers: headers, bodyValue: parts, boundary: boundaryStr);

This should be something like this:

return MultipartEntity!(typeof(headers), MultipartR)(headers: headers, bodyValue: parts, boundary: boundaryStr);

The named argument initialization probably cannot be used, yet, due to backwards compatibility and I'm not sure how well it plays with tuple typed fields, but in any case, the MultipartR types need to be explicitly passed to the MultipartEntity template type.

edit: The MultiPartEntity declaration would look like this:

struct MultipartEntity(HEADER_MAP, PARTS...) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}
vnayar commented 1 month ago

edit: The MultiPartEntity declaration would look like this:

struct MultipartEntity(HEADER_MAP, PARTS...) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}

Wouldn't that make it impossible to declare an object in practice? Imagine trying to a assign to parts. I could try something like parts = [ MultiPartEntity!(InetHeaderMap, string), MultiPartEntity!(InetHeaderMap, File) ], but these two items contain no common type, there could be no type implied for parts or even the literal value being assigned to it.

I don't think this level of desired flexibility can be accomplished without Variant. One either needs to sharply limit the types that a MultiPartEntity body can have, or use Variant. Variant is also only a half-solution, because one still needs to know how to write the contents of the body to a Stream. So it does matter whether it's a File, Stream, etc.

s-ludwig commented 1 month ago

The nice thing is that the user can prepare the parts however it fits best:

// functional style:

auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    formPart("file", NativePath("foo.dat"));

auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    paths.map!(p => formPart("file", p)));

// imperative style:

typeof(formPart("", NativePart.init))[] files;
foreach (path; paths)
    files ~= formPart("file", path);
auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    files);

alias MyPartTypes = SumType!(typeof(formPart("", "")), typeof(formPart("", NativePath.init));
MyPartTypes[] parts;
parts ~= MyPartTypes(formPart("foo", "bar"));
parts ~= MyPartTypes(formPart("foo", NativePath("bar.dat")));
auto entity = multiPartEntity(header, parts);

The necessity for typeof when going the imperative route isn't very nice, but we could pre-define some type aliases for the two simplest cases, if that style of initialization is deemed important enough. However, I think that just using separate parameters for different types will work for almost all practical use cases, so that the necessity for SumType (or Variant) should be quite a niche case in practice.