vncorenlp / VnCoreNLP

A Vietnamese natural language processing toolkit (NAACL 2018)
Other
587 stars 145 forks source link

Form too large error with VncoreNLPServer #25

Closed Avi197 closed 4 years ago

Avi197 commented 4 years ago

I got this problem when tokenizing data using vncorenlp python, and it work fine until it reach 1 of the line in the data file.

org.eclipse.jetty.http.BadMessageException: 400: Unable to parse form content at org.eclipse.jetty.server.Request.getParameters(Request.java:380) at org.eclipse.jetty.server.Request.getParameter(Request.java:1021) at javax.servlet.ServletRequestWrapper.getParameter(ServletRequestWrapper.java:194) at spark.Request.queryParams(Request.java:283) at spark.http.matching.RequestWrapper.queryParams(RequestWrapper.java:141) at vncorenlp.VnCoreNLPServer.handle(VnCoreNLPServer.java:247) at vncorenlp.VnCoreNLPServer.lambda$3(VnCoreNLPServer.java:184) at spark.ResponseTransformerRouteImpl$1.handle(ResponseTransformerRouteImpl.java:47) at spark.http.matching.Routes.execute(Routes.java:61) at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:130) at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1568) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:530) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IllegalStateException: Form too large: 250544 > 200000 at org.eclipse.jetty.server.Request.extractFormParameters(Request.java:523) at org.eclipse.jetty.server.Request.extractContentParameters(Request.java:461) at org.eclipse.jetty.server.Request.getParameters(Request.java:376)

Best regards

datquocnguyen commented 4 years ago

If using only VnCoreNLP's word segmenter, you should apply annotators="wseg". You might use the second option (using without a service).
Here is an example of Using VnCoreNLP's word segmenter to pre-process input raw texts.

Avi197 commented 4 years ago

I did try both option, service and without service, both return "Unable to parse form content" error. I use this line without service with VnCoreNLP(vncorenlp_file, annotators="wseg", max_heap_size='-Xmx4g') as vncorenlp:

Without service, it return

AssertionError: 400: Unable to parse form content

With service, it return a more specific error

org.eclipse.jetty.http.BadMessageException: 400: Unable to parse form content

java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IllegalStateException: Form too large: 250544 > 200000 I process data line by line so my guess is one is a bit too big? It work fine until the script reach that line

tienthanhdhcn commented 4 years ago

Have you downloaded the model and put it to the same folder of the jar file?

Avi197 commented 4 years ago

Yes, as i said above, it work fine until the script reach a specific line in the data file, then return the error

java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IllegalStateException: Form too large: 250544 > 200000

datquocnguyen commented 4 years ago

Then you might want to split this file into multiple smaller ones, and concatenate them later after performing word segmentation. Or you can use the original version of RDRSegmenter.

tienthanhdhcn commented 4 years ago

I think the problem is with jetty and the python wrapper? Probably your input text is too long and you should split it into smaller chunks. I refer you to https://github.com/dnanhkhoa/python-vncorenlp if it is the wrapper-related problem.

datquocnguyen commented 4 years ago

Closed! It's because this issue is related to the wrapper, not VnCoreNLP itself.