MultiPartInputStreamParser is slow for largish files

LukeButters commented 8 years ago

Hi I have an API which accepts a JSON part and a file part. I noticed that with multipart request for a document about 50KB the total time to push in some number of documents was taking an extra 26% longer compared to not using multipart. Looking at the hotspots I noticed a lot of time was spent in MultiPartInputStreamParser.

It would be nice if this was faster, well I would certainly benefit from that class being faster.

LukeButters commented 8 years ago

I have made a pull request which changes MultiPartInputStream to be faster. The jetty util tests pass however when it gets to HTTP Client Transports where the VM seems to exit, I don't call System.exit so I am not sure why this is happening on my machine. Hopefully we can work together to get this in.

joakime commented 7 years ago

@janbartel another MultiPartInputStream issue to review.

LukeButters commented 7 years ago

"another MultiPartInputStream issue to review." @joakime @janbartel is the multipart parser being rewritten?

gregw commented 7 years ago

@LukeButtersFunnelback it is our intention to rewrite the parser, but we are definitely behind in our schedule for that.

@janbartel can we look at the PR again, perhaps it is prudent to merge that pending the rewrite?

LukeButters commented 7 years ago

At this point I wont push to have this accepted, or fix that issue where a file is used for requests that are 1 byte to small or big (i forget the details)

For whoever ends up writing the new parser it might be worth looking at what I did here to try and come up with a fast parser. Alternatively develop the parser and remind me to have a look at the parser so I can suggest speed ups.

janbartel commented 6 years ago

@olamy I'm assigning you to this issue. Can you coordinate with @gregw regarding a rewrite of the multipart stuff - we want to do it async.

LukeButters commented 6 years ago

Has any progress been made on the new multipart parser? Is it possible to have jetty use a different class to process multi part requests?

gregw commented 6 years ago

@LukeButtersFunnelback apologies, but this work has not yet been resourced. It is definitely something we intend to do, but we have just not yet found the priority to do so. To come up with a performance async multipart parser will need 3 or 4 days effort of clear thinking time, specially as we want it to be usable by reactive streams.

Unfortunately the class is not pluggable.

So we will try to look at this during March. If we do not make progress, then we will reconsider #1028 as an interim measure.

gregw commented 6 years ago

I've started work on this as my background task.

Currently my plan is to cut/paste/modify our HttpParser for the parsing of the headers of each part, then add a Boyer Moore search mechanism to efficiently find the boundaries in the text. The complications are that this parser needs to deal with partial content and that we prefer parsers to be fully consuming so that we don't have to start buffer fills from non zero positions.

gregw commented 6 years ago

@lachlan-roberts have a read of the background on this one as I will get you to review my work on it.

LukeButters commented 6 years ago

If you are copying the HttpParser it might be worth mentioning that Jetty is relatively slow parsing headers (compared to something like jackson parsing json when given a byte array), I didn't look into why it is slower than expected. If I had to guess it is likely the issue is reading one byte from the stream at a time which is slower than reading a chunk. This will make your code complicated :(

gregw commented 6 years ago

Actually straight BM search might not be the best algorithm. We need to digest the info in https://arxiv.org/pdf/1012.2547v1.pdf and pick a good algorithm that is not too difficult to implement.

Our needles (aka boundary patterns) are typically in the order of 32 bytes, but can be shorter or longer. Our haystacks (aka search text) are in a 256 character alphabet, but probably with a heavy bias on <60 of those characters. Size can be anything from a few 10s of bytes to several GBs.

I think the frequency of large uploads is sufficient that we can invest some effort in preprocessing our needle.

gregw commented 6 years ago

I'm currently leaning towards using the horspool algorithm as it appears to be generally good and looks pretty simple to implement (as any of the more complex algorithms would best be supported with a library).

We still need to work out how to handle the residue characters in a buffer of a partial part.... it is probably worth while scanning them for --, as if that is not there, then they can be considered content and consumed.

gregw commented 6 years ago

@LukeButtersFunnelback I'm not sure it is a fair comparison looking at HTTP vs JSON parsing.

Firstly HTTP is a cantankerous old fashion protocol with lots of legacy ifs and buts and an overly complex specification for such a simple semantics. JSON is a much more regular format with a good clear simple specification.
Secondly Jetty's HttpParser is doing a lot more than just parsing the HTTP message as it is also interpreting the meaning of the main header fields and tokenising their keys and many values. Jetty's HTTP parser certainly doesn't read a byte from the stream at time. It works on chunks of bytes and uses efficient Trie lookup tries to both parse fields and to identify known fields both from the protocol (eg "Connection: close") and from a cache for the connection (eg "User-Agent: someStringThatIsTheSameInEveryRequest"). Json parser will just be creating a Map<String,String> and much post processing would be needed to achieve the same semantic handler that is provided by the HttpParser.

Note also that for multipart, much of the complexity will be able to be taken out. Besides I doubt that the expense of multipart is parsing the few headers at the start of each part, rather it is looking for the boundary within the possible very large part content. This is currently done with a bruteforce character by character search. Using a BM based search algorithm is going to greatly improve this: just consider that a 32 byte boundary in a document with 64 frequently used characters will have at least a 50% of a bad character match, allowing the pattern to be shifted by 32 bytes and 31 comparisons skipped.

gregw commented 6 years ago

@janbartel Do you know why MultiPartInputStreamParser is in jetty-util rather than jetty-http ? It is the only thing in jetty-util that requires the servlet-api? I'm inclined to create the new parser in jetty-http, as then we can internally reuse classes like HttpFields

gregw commented 6 years ago

@janbartel never mind, I don't think we can use any http classes as while similar the multipart RFCs do things a little differently (eg % encoded rather than \ escaped values). Best to re-implement.

gregw commented 6 years ago

@lachlan-roberts Perhaps you can work on implementing this:

    public interface SearchPattern
    {
        /**
         * @param pattern  The pattern to search for.
         * @return A Pattern instance for the search pattern
         */
        static SearchPattern compile(String pattern) { return null; }

        /**
         * Search for a complete match of the pattern within the data
         * @param data The data in which to search for. The data may be arbitrary binary data, 
         * but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
         * @param offset The offset within the data to start the search
         * @param length The length of the data to search
         * @return The index within the data array at which the first instance of the pattern or -1 if not found
         */
        public int match(byte[] data, int offset, int length);

        /**
         * Search for a partial match of the pattern at the end of the data.
         * @param data The data in which to search for. The data may be arbitrary binary data, 
         * but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
         * @param offset The offset within the data to start the search
         * @param length The length of the data to search
         * @return the length of the partial pattern matched or -1 for no match.
         */
        public int endsPartiallyWith(byte data, int offset, int length);

        /**
         * 
         * Search for a partial match of the pattern at the start of the data.
         * @param data The data in which to search for. The data may be arbitrary binary data, 
         * but the pattern will always be {@link StandardCharsets#US_ASCII} encoded.
         * @param offset The offset within the data to start the search
         * @param length The length of the data to search
         * @param matched The length of the partial pattern already matched
         * @return the length of the partial pattern matched or -1 for no match.
         */
        public int startsPartialyWith(byte[] data, int offset, int length, int matched);
    }

gregw commented 6 years ago

@lachlan-roberts I'm using your SearchPattern to build the new MultiPartParser in branch https://github.com/eclipse/jetty.project/tree/jetty-9.4.x-1027-Multipart I've got it working for the preamble and the first fields and am now up to the point of parsing the first part content. Do you want to give this a go? You'll need to pull from my branch.

gregw commented 6 years ago

@lachlan-roberts I've just added some test cases for part content to the unit test (see 1352be4308b87b10c04c8c04175e23e34d449f6a)

lachlan-roberts commented 6 years ago

@LukeButtersFunnelback the changes for the faster MultiPart parser have been merged into the main branch, you can test out the changes by setting the multipart/form-data compliance mode to RFC7578 in the server.ini file.

LukeButters commented 6 years ago

Um n00b question, I don't actually know the literal line that would need to added. Also I have jetty setup programmatically should I be calling something on my org.eclipse.jetty.server.Server object instead?

lachlan-roberts commented 6 years ago

These lines should be in server.ini just needs to be uncommented and changed to RFC7578.

## multipart/form-data compliance mode of: LEGACY(slow), RFC7578(fast)
# jetty.httpConfig.multiPartFormDataCompliance=LEGACY

Otherwise you should be able to change it by calling setMultiPartFormDataCompliance(MultiPartFormDataCompliance.RFC7578) on the org.eclipse.jetty.server.HttpConfiguration class.

lachlan-roberts commented 6 years ago

You can get a HttpConfiguration instance from the HttpConnectionFactory in the connector. So to set the compliance mode to RFC7578 you can do something like this. connector.getConnectionFactory(HttpConnectionFactory.class).getHttpConfiguration().setMultiPartFormDataCompliance(MultiPartFormDataCompliance.RFC7578);

lachlan-roberts commented 6 years ago

@LukeButtersFunnelback any chance to test this?

LukeButters commented 6 years ago

Not yet but I hope to do it today or tomorrow (The day that you told me this was ready I was going away, I have returned today but I am not sure if I can get to it as I need to respond to a bunch of emails)

LukeButters commented 6 years ago

Ok so i have some done some comparisons:

Before this change: ~1700 per second
After the change:     ~2330 per second
Compared to non multipart: ~3400 per second.

I am sending a multipart which has a very small json part (less than 100bytes) in the first part and some content 10KB-1MB in the second part. For the non multi part I move the data that is in json to the HTTP headers.

It is faster thanks!!!!

Although it would be nice to be able to get close to what can be done for non multipart.

I have done some analysis and found that a MultiException is created in org.eclipse.jetty.http.MultiPartFormInputStream.deleteParts() L450. Creating exceptions is very expensive because of JVM's need to fill in the stack trace. I think the exception should only be made if we actually have an exception to give to the MultiException. I can re-run the the checks once that change is in and try to work out if anything else can be done to help it run fast.

With multipart would it be possible to (once the request is parse) have access to the byte arrays each part is written to rather than read from the InputStream of each part.

janbartel commented 6 years ago

Hi Luke, thanks for testing. Do you have an example of your test data so can test with that too?

We can look into lazily creating the exception in deleteParts.

Not sure about accessing the byte array - for some requests the data is written to a file so there is no byte array.

gregw commented 6 years ago

@LukeButtersFunnelback I think one of the changes I'd like to make in the next round of the servlet spec is better access to the content of Part - including both raw bytes or characters, without using a blocking API.

However, I'm not sure a simple byte[] getBytes() method is sufficient as that array could be very large.

LukeButters commented 6 years ago

Short answer is I probably shouldn't give you an example of the data I am currently using however I think the data is not actually that important in this test.

What is it that you are looking for in the data? Do you want to see how I format the multipart request?

Yes I think that the exception should be created only when it is going to be thrown, perhaps the class needs to be re-thought out. Do you believe that it is worth doing? If not would you like me to clone and hack some changes to give you an idea of the speed up that would be gained? @janbartel @lachlan-roberts @gregw

What I would like to avoid is:

unnecessary creation of byte[] because that is surprisingly expensive. (do you know about larray where you can allocate ofheap memory way cheaper than the JVM perhaps this is an option that could be added)
unnecessary copying of data (note that if a ByteArrayInputStream is too small it will make new byte[] and copy data over to the new one this can be avoiding if the initial buffer is the correct size which maybe we can work out from the headers in each part). I think neither is easy to solve and neither is going to get multipart to be as fast as non multi-part.

LukeButters commented 6 years ago

The unnecessary exception was removed in https://github.com/eclipse/jetty.project/pull/2582 .

Thanks for working on this.

jetty / jetty.project

MultiPartInputStreamParser is slow for largish files #1027