Closed LukeButters closed 6 years ago
I have made a pull request which changes MultiPartInputStream to be faster. The jetty util tests pass however when it gets to HTTP Client Transports
where the VM seems to exit, I don't call System.exit so I am not sure why this is happening on my machine.
Hopefully we can work together to get this in.
@janbartel another MultiPartInputStream issue to review.
"another MultiPartInputStream issue to review." @joakime @janbartel is the multipart parser being rewritten?
@LukeButtersFunnelback it is our intention to rewrite the parser, but we are definitely behind in our schedule for that.
@janbartel can we look at the PR again, perhaps it is prudent to merge that pending the rewrite?
At this point I wont push to have this accepted, or fix that issue where a file is used for requests that are 1 byte to small or big (i forget the details)
For whoever ends up writing the new parser it might be worth looking at what I did here to try and come up with a fast parser. Alternatively develop the parser and remind me to have a look at the parser so I can suggest speed ups.
@olamy I'm assigning you to this issue. Can you coordinate with @gregw regarding a rewrite of the multipart stuff - we want to do it async.
Has any progress been made on the new multipart parser? Is it possible to have jetty use a different class to process multi part requests?
@LukeButtersFunnelback apologies, but this work has not yet been resourced. It is definitely something we intend to do, but we have just not yet found the priority to do so. To come up with a performance async multipart parser will need 3 or 4 days effort of clear thinking time, specially as we want it to be usable by reactive streams.
Unfortunately the class is not pluggable.
So we will try to look at this during March. If we do not make progress, then we will reconsider #1028 as an interim measure.
I've started work on this as my background task.
Currently my plan is to cut/paste/modify our HttpParser for the parsing of the headers of each part, then add a Boyer Moore search mechanism to efficiently find the boundaries in the text. The complications are that this parser needs to deal with partial content and that we prefer parsers to be fully consuming so that we don't have to start buffer fills from non zero positions.
@lachlan-roberts have a read of the background on this one as I will get you to review my work on it.
If you are copying the HttpParser it might be worth mentioning that Jetty is relatively slow parsing headers (compared to something like jackson parsing json when given a byte array), I didn't look into why it is slower than expected. If I had to guess it is likely the issue is reading one byte from the stream at a time which is slower than reading a chunk. This will make your code complicated :(
Actually straight BM search might not be the best algorithm. We need to digest the info in https://arxiv.org/pdf/1012.2547v1.pdf and pick a good algorithm that is not too difficult to implement.
Our needles (aka boundary patterns) are typically in the order of 32 bytes, but can be shorter or longer. Our haystacks (aka search text) are in a 256 character alphabet, but probably with a heavy bias on <60 of those characters. Size can be anything from a few 10s of bytes to several GBs.
I think the frequency of large uploads is sufficient that we can invest some effort in preprocessing our needle.
I'm currently leaning towards using the horspool algorithm as it appears to be generally good and looks pretty simple to implement (as any of the more complex algorithms would best be supported with a library).
We still need to work out how to handle the residue characters in a buffer of a partial part.... it is probably worth while scanning them for --, as if that is not there, then they can be considered content and consumed.
@LukeButtersFunnelback I'm not sure it is a fair comparison looking at HTTP vs JSON parsing.
Firstly HTTP is a cantankerous old fashion protocol with lots of legacy ifs and buts and an overly complex specification for such a simple semantics. JSON is a much more regular format with a good clear simple specification.
Secondly Jetty's HttpParser is doing a lot more than just parsing the HTTP message as it is also interpreting the meaning of the main header fields and tokenising their keys and many values.
Jetty's HTTP parser certainly doesn't read a byte from the stream at time. It works on chunks of bytes and uses efficient Trie lookup tries to both parse fields and to identify known fields both from the protocol (eg "Connection: close") and from a cache for the connection (eg "User-Agent: someStringThatIsTheSameInEveryRequest"). Json parser will just be creating a Map<String,String> and much post processing would be needed to achieve the same semantic handler that is provided by the HttpParser.
Note also that for multipart, much of the complexity will be able to be taken out. Besides I doubt that the expense of multipart is parsing the few headers at the start of each part, rather it is looking for the boundary within the possible very large part content. This is currently done with a bruteforce character by character search. Using a BM based search algorithm is going to greatly improve this: just consider that a 32 byte boundary in a document with 64 frequently used characters will have at least a 50% of a bad character match, allowing the pattern to be shifted by 32 bytes and 31 comparisons skipped.
@janbartel Do you know why MultiPartInputStreamParser
is in jetty-util rather than jetty-http ? It is the only thing in jetty-util that requires the servlet-api?
I'm inclined to create the new parser in jetty-http, as then we can internally reuse classes like HttpFields
@janbartel never mind, I don't think we can use any http classes as while similar the multipart RFCs do things a little differently (eg % encoded rather than \ escaped values). Best to re-implement.
@lachlan-roberts Perhaps you can work on implementing this:
public interface SearchPattern
{
/**
* @param pattern The pattern to search for.
* @return A Pattern instance for the search pattern
*/
static SearchPattern compile(String pattern) { return null; }
/**
* Search for a complete match of the pattern within the data
* @param data The data in which to search for. The data may be arbitrary binary data,
* but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
* @param offset The offset within the data to start the search
* @param length The length of the data to search
* @return The index within the data array at which the first instance of the pattern or -1 if not found
*/
public int match(byte[] data, int offset, int length);
/**
* Search for a partial match of the pattern at the end of the data.
* @param data The data in which to search for. The data may be arbitrary binary data,
* but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
* @param offset The offset within the data to start the search
* @param length The length of the data to search
* @return the length of the partial pattern matched or -1 for no match.
*/
public int endsPartiallyWith(byte data, int offset, int length);
/**
*
* Search for a partial match of the pattern at the start of the data.
* @param data The data in which to search for. The data may be arbitrary binary data,
* but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
* @param offset The offset within the data to start the search
* @param length The length of the data to search
* @param matched The length of the partial pattern already matched
* @return the length of the partial pattern matched or -1 for no match.
*/
public int startsPartialyWith(byte[] data, int offset, int length, int matched);
}
@lachlan-roberts I'm using your SearchPattern to build the new MultiPartParser in branch https://github.com/eclipse/jetty.project/tree/jetty-9.4.x-1027-Multipart I've got it working for the preamble and the first fields and am now up to the point of parsing the first part content. Do you want to give this a go? You'll need to pull from my branch.
@lachlan-roberts I've just added some test cases for part content to the unit test (see 1352be4308b87b10c04c8c04175e23e34d449f6a)
@LukeButtersFunnelback the changes for the faster MultiPart parser have been merged into the main branch, you can test out the changes by setting the multipart/form-data compliance mode to RFC7578 in the server.ini file.
Um n00b question, I don't actually know the literal line that would need to added. Also I have jetty setup programmatically should I be calling something on my org.eclipse.jetty.server.Server
object instead?
These lines should be in server.ini just needs to be uncommented and changed to RFC7578.
## multipart/form-data compliance mode of: LEGACY(slow), RFC7578(fast)
# jetty.httpConfig.multiPartFormDataCompliance=LEGACY
Otherwise you should be able to change it by calling setMultiPartFormDataCompliance(MultiPartFormDataCompliance.RFC7578)
on the org.eclipse.jetty.server.HttpConfiguration
class.
You can get a HttpConfiguration instance from the HttpConnectionFactory in the connector. So to set the compliance mode to RFC7578 you can do something like this.
connector.getConnectionFactory(HttpConnectionFactory.class).getHttpConfiguration().setMultiPartFormDataCompliance(MultiPartFormDataCompliance.RFC7578);
@LukeButtersFunnelback any chance to test this?
Not yet but I hope to do it today or tomorrow (The day that you told me this was ready I was going away, I have returned today but I am not sure if I can get to it as I need to respond to a bunch of emails)
Ok so i have some done some comparisons:
Before this change: ~1700 per second
After the change: ~2330 per second
Compared to non multipart: ~3400 per second.
I am sending a multipart which has a very small json part (less than 100bytes) in the first part and some content 10KB-1MB in the second part. For the non multi part I move the data that is in json to the HTTP headers.
It is faster thanks!!!!
Although it would be nice to be able to get close to what can be done for non multipart.
I have done some analysis and found that a MultiException
is created in org.eclipse.jetty.http.MultiPartFormInputStream.deleteParts()
L450
. Creating exceptions is very expensive because of JVM's need to fill in the stack trace. I think the exception should only be made if we actually have an exception to give to the MultiException
. I can re-run the the checks once that change is in and try to work out if anything else can be done to help it run fast.
With multipart would it be possible to (once the request is parse) have access to the byte arrays each part is written to rather than read from the InputStream of each part.
Hi Luke, thanks for testing. Do you have an example of your test data so can test with that too?
We can look into lazily creating the exception in deleteParts.
Not sure about accessing the byte array - for some requests the data is written to a file so there is no byte array.
@LukeButtersFunnelback I think one of the changes I'd like to make in the next round of the servlet spec is better access to the content of Part - including both raw bytes or characters, without using a blocking API.
However, I'm not sure a simple byte[] getBytes()
method is sufficient as that array could be very large.
Short answer is I probably shouldn't give you an example of the data I am currently using however I think the data is not actually that important in this test.
What is it that you are looking for in the data? Do you want to see how I format the multipart request?
Yes I think that the exception should be created only when it is going to be thrown, perhaps the class needs to be re-thought out. Do you believe that it is worth doing? If not would you like me to clone and hack some changes to give you an idea of the speed up that would be gained? @janbartel @lachlan-roberts @gregw
What I would like to avoid is:
byte[]
because that is surprisingly expensive. (do you know about larray where you can allocate ofheap memory way cheaper than the JVM perhaps this is an option that could be added)ByteArrayInputStream
is too small it will make new byte[]
and copy data over to the new one this can be avoiding if the initial buffer is the correct size which maybe we can work out from the headers in each part).
I think neither is easy to solve and neither is going to get multipart to be as fast as non multi-part.The unnecessary exception was removed in https://github.com/eclipse/jetty.project/pull/2582 .
Thanks for working on this.
Hi I have an API which accepts a JSON part and a file part. I noticed that with multipart request for a document about 50KB the total time to push in some number of documents was taking an extra 26% longer compared to not using multipart. Looking at the hotspots I noticed a lot of time was spent in MultiPartInputStreamParser.
It would be nice if this was faster, well I would certainly benefit from that class being faster.