Closed sundbp closed 6 years ago
Hi @sundbp, thank you for the report. This seems like an issue in our Posix NIO implementation. We'll take a look at it. Can you provide the app that you are trying to run so we can reproduce?
I'll see if I can cut it down to a small repro case with instructions.
On Thu, 3 May 2018, 23:05 Codrut Stancu, notifications@github.com wrote:
Hi @sundbp https://github.com/sundbp, thank you for the report. This seems like an issue in our Posix NIO implementation. We'll take a look at it. Can you provide the app that you are trying to run so we can reproduce?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/graal/issues/400#issuecomment-386451373, or mute the thread https://github.com/notifications/unsubscribe-auth/AADxJ3BSo9U-QhAdt4ca90RhLg5MU4wYks5tu38QgaJpZM4Tx1k3 .
A cut down version that replicates it can be found here: https://gitlab.com/sundbp/native-image-grpc-test
The README has instructions for how to reproduce. Please let me know if you have issues reproducing it!
@sundbp thank you for providing a cut down version and for the detailed steps to reproduce the issue. We'll run it and let you know what we find.
Did you manage to reproduce from that example repo? Thanks
I was able to reproduce, thank you! Looking at the log output of ./create-exe.sh
I see a lot of warnings of the form:
RecomputeFieldValue.FieldOffset automatic substitution failed. The automatic substitution registration was attempted because a call to sun.misc.Unsafe.objectFieldOffset(Field) was detected in the static initializer of com.google.protobuf.UnsafeUtil. Add a RecomputeFieldValue.FieldOffset manual substitution for com.google.protobuf.UnsafeUtil.
These are generated by SubstrateVM and are quite serious. (As a side node: we should increase the visibility of this kind of messages especially when they are intermingled with log messages from the static initializers of the library that we are trying to build, like it happens with the Netty DEBUG log messages). The reason for these messages is that com.google.protobuf.UnsafeUtil
uses the Java unsafe API and this needs a special configuration to work on SubstrateVM. Initially I thought that this might be the reason for the failure: writing at a wrong offset results in memory corruption. (By the way, we will publish an article next week that includes details about the use of unsafe on SubstrateVM.) However, after fixing these I still get the java.io.IOException: Bad address
, so I guess the problem is really in our PosixJavaNIOSubstitutions
implementation.
Great - thanks for progress report.
Question: Could I have done anything myself to fix the UnsafeUtil errors/warnings? Short of understanding all of SVM :) I'd love to learn enough to sort these kinds of things for myself going fwd.
On Sat, May 12, 2018 at 3:02 AM, Codrut Stancu notifications@github.com wrote:
I was able to reproduce, thank you! Looking at the log output of ./create-exe.sh I see a lot of warnings of the form:
RecomputeFieldValue.FieldOffset automatic substitution failed. The automatic substitution registration was attempted because a call to sun.misc.Unsafe.objectFieldOffset(Field) was detected in the static initializer of com.google.protobuf.UnsafeUtil. Add a RecomputeFieldValue.FieldOffset manual substitution for com.google.protobuf.UnsafeUtil.
These are generated by SubstrateVM and are quite serious. (As a side node: we should increase the visibility of this kind of messages especially when they are intermingled with log messages from the static initializers of the library that we are trying to build, like it happens with the Netty DEBUG log messages). The reason for these messages is that com.google.protobuf.UnsafeUtil uses the Java unsafe API and this needs a special configuration to work on SubstrateVM. Initially I thought that this might be the reason for the failure: writing at a wrong offset results in memory corruption. (By the way, we will publish an article next week that includes details about the use of unsafe on SubstrateVM.) However, after fixing these I still get the java.io.IOException: Bad address, so I guess the problem is really in our PosixJavaNIOSubstitutions implementation.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/graal/issues/400#issuecomment-388522618, or mute the thread https://github.com/notifications/unsubscribe-auth/AADxJ1zcP7SSR3JoXzg5QJMH40euNQOzks5txkLLgaJpZM4Tx1k3 .
Definitely, we have a mechanism that allows you to do that, although the end goal is to automate the process of detecting and patching unsafe operations. We have a blog article scheduled for next week that goes into details.
Excellent - Looking forward to reading the blog article! (and follow up on what goes wrong with PosixJavaNio!)
I've now read the blog post - great! (referring to this post: https://medium.com/graalvm/instant-netty-startup-using-graalvm-native-image-generation-ed6f14ff7692)
@cstancu would you mind making a PR with the unsafe substitutions you made for my example repo? I'd love to see if it matches up with how I'd do it after reading the blog post.
Actually the substitutions that I used were quite simple. For protobuf I cheated and essentially disabled unsafe memory access by just returning null
from com.google.protobuf.UnsafeUtil.getUnsafe()
. Protobuf is built with a failsafe mechanism if unsafe memory access is not available on a platform. For JCTools is just the same substitution as in the article. Here is the entire substitution file:
import com.oracle.svm.core.annotate.Alias;
import com.oracle.svm.core.annotate.RecomputeFieldValue;
import com.oracle.svm.core.annotate.RecomputeFieldValue.Kind;
import com.oracle.svm.core.annotate.Substitute;
import com.oracle.svm.core.annotate.TargetClass;
@TargetClass(className = "com.google.protobuf.UnsafeUtil")
final class Target_com_google_protobuf_UnsafeUtil {
@Substitute
static sun.misc.Unsafe getUnsafe() {
return null;
}
}
@TargetClass(className = "io.netty.util.internal.shaded.org.jctools.util.UnsafeRefArrayAccess")
final class Target_io_netty_util_internal_shaded_org_jctools_util_UnsafeRefArrayAccess {
@Alias @RecomputeFieldValue(kind = Kind.ArrayIndexShift, declClass = Object[].class) //
public static int REF_ELEMENT_SHIFT;
}
public class SubstrateVMSubstitutions {
}
Ah! Nifty to disable it all! thanks
@cstancu I'm struggling to get a build going with those substitutions included. If I use graal as distributed in the 1.0.0-rc1 I don't have the --report-usupported-elements-at-runtime
option available and the build fails with:
Caused by: java.lang.NoClassDefFoundError: io/netty/handler/ssl/ReferenceCountedOpenSslServerContext$OpenSslSniHostnameMatcher
at java.lang.Class.getDeclaringClass0(Native Method)
at java.lang.Class.getDeclaringClass(Class.java:1235)
at java.lang.Class.getEnclosingClass(Class.java:1277)
at jdk.vm.ci.hotspot.HotSpotResolvedObjectTypeImpl.getEnclosingType(HotSpotResolvedObjectTypeImpl.java:894)
at jdk.vm.ci.hotspot.HotSpotResolvedObjectTypeImpl.getEnclosingType(HotSpotResolvedObjectTypeImpl.java:58)
at com.oracle.graal.pointsto.meta.AnalysisType.getEnclosingType(AnalysisType.java:917)
at com.oracle.svm.hosted.analysis.Inflation.checkType(Inflation.java:138)
at java.lang.Iterable.forEach(Iterable.java:75)
at java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
at com.oracle.svm.hosted.analysis.Inflation.checkObjectGraph(Inflation.java:123)
at com.oracle.graal.pointsto.BigBang.checkObjectGraph(BigBang.java:580)
at com.oracle.graal.pointsto.BigBang.finish(BigBang.java:552)
at com.oracle.svm.hosted.NativeImageGenerator.doRun(NativeImageGenerator.java:653)
at com.oracle.svm.hosted.NativeImageGenerator.lambda$run$0(NativeImageGenerator.java:381)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
If I try to use native-image from git (after installing the svm.jar from that build), and using the --report-unsupported-elements-at-runtime
option I get:
error: Error encountered while parsing io.netty.util.internal.InternalThreadLocalMap.expandIndexedVariableTableAndSet(int, java.lang.Object)
Parsing context:
parsing io.netty.util.internal.InternalThreadLocalMap.setIndexedVariable(InternalThreadLocalMap.java:297)
parsing io.netty.util.concurrent.FastThreadLocal.registerCleaner(FastThreadLocal.java:159)
parsing io.netty.util.concurrent.FastThreadLocal.get(FastThreadLocal.java:147)
parsing io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:319)
parsing io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
parsing io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
parsing io.netty.handler.ssl.ReferenceCountedOpenSslContext.toBIO(ReferenceCountedOpenSslContext.java:797)
parsing io.netty.handler.ssl.ReferenceCountedOpenSslContext.setKeyMaterial(ReferenceCountedOpenSslContext.java:715)
parsing io.netty.handler.ssl.ReferenceCountedOpenSslClientContext.newSessionContext(ReferenceCountedOpenSslClientContext.java:100)
parsing io.netty.handler.ssl.ReferenceCountedOpenSslClientContext.<init>(ReferenceCountedOpenSslClientContext.java:63)
parsing io.netty.handler.ssl.SslContext.newClientContextInternal(SslContext.java:781)
parsing io.netty.handler.ssl.SslContextBuilder.build(SslContextBuilder.java:446)
parsing io.grpc.netty.NettyChannelBuilder$NettyTransportFactory$DefaultNettyTransportCreationParamsFilterFactory.<init>(NettyChannelBuilder.java:558)
parsing io.grpc.netty.NettyChannelBuilder$NettyTransportFactory$DefaultNettyTransportCreationParamsFilterFactory.<init>(NettyChannelBuilder.java:551)
parsing io.grpc.netty.NettyChannelBuilder$NettyTransportFactory.<init>(NettyChannelBuilder.java:489)
parsing io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:337)
parsing io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:393)
parsing grpc_test.client.main$create_client.invokeStatic(main.clj:17)
parsing grpc_test.client.main$create_client.invoke(main.clj:15)
parsing clojure.tools.logging.impl$fn__544$G__528__551.invoke(impl.clj:25)
parsing grpc_test.client.main$_main$fn__693.invoke(main.clj:72)
parsing clojure.lang.AFn.run(AFn.java:22)
parsing com.oracle.svm.core.jdk.RuntimeSupport.executeHooks(RuntimeSupport.java:106)
parsing com.oracle.svm.core.jdk.RuntimeSupport.executeStartupHooks(RuntimeSupport.java:83)
parsing com.oracle.svm.core.JavaMainWrapper.run(JavaMainWrapper.java:170)
parsing Lcom/oracle/svm/core/code/CEntryPointCallStubs;.com_002eoracle_002esvm_002ecore_002eJavaMainWrapper_002erun_0028int_002corg_002egraalvm_002enativeimage_002ec_002etype_002eCCharPointerPointer_0029(generated:0)
Original error: org.graalvm.compiler.graph.GraalGraphError: java.lang.AbstractMethodError: Method com/oracle/graal/pointsto/nodes/AnalysisArraysCopyOfNode.length()Lorg/graalvm/compiler/nodes/ValueNode; is abstract
at node: 24|ArrayLength
at org.graalvm.compiler.phases.common.CanonicalizerPhase$Instance.tryCanonicalize(CanonicalizerPhase.java:338)
at org.graalvm.compiler.phases.common.CanonicalizerPhase$Instance.processNode(CanonicalizerPhase.java:260)
at org.graalvm.compiler.phases.common.CanonicalizerPhase$Instance.processWorkSet(CanonicalizerPhase.java:239)
at org.graalvm.compiler.phases.common.CanonicalizerPhase$Instance.run(CanonicalizerPhase.java:209)
at org.graalvm.compiler.phases.common.CanonicalizerPhase.run(CanonicalizerPhase.java:123)
at org.graalvm.compiler.phases.common.CanonicalizerPhase.run(CanonicalizerPhase.java:64)
at org.graalvm.compiler.phases.BasePhase.apply(BasePhase.java:195)
at org.graalvm.compiler.phases.BasePhase.apply(BasePhase.java:137)
at com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.parse(MethodTypeFlowBuilder.java:205)
at com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.apply(MethodTypeFlowBuilder.java:319)
at com.oracle.graal.pointsto.flow.MethodTypeFlow.doParse(MethodTypeFlow.java:308)
at com.oracle.graal.pointsto.flow.MethodTypeFlow.ensureParsed(MethodTypeFlow.java:298)
at com.oracle.graal.pointsto.flow.MethodTypeFlow.addContext(MethodTypeFlow.java:105)
at com.oracle.graal.pointsto.flow.SpecialInvokeTypeFlow.onObservedUpdate(InvokeTypeFlow.java:419)
at com.oracle.graal.pointsto.flow.TypeFlow.notifyObservers(TypeFlow.java:345)
at com.oracle.graal.pointsto.flow.TypeFlow.update(TypeFlow.java:387)
at com.oracle.graal.pointsto.BigBang$2.run(BigBang.java:498)
at com.oracle.graal.pointsto.util.CompletionExecutor.lambda$execute$0(CompletionExecutor.java:172)
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.lang.AbstractMethodError: Method com/oracle/graal/pointsto/nodes/AnalysisArraysCopyOfNode.length()Lorg/graalvm/compiler/nodes/ValueNode; is abstract
at com.oracle.graal.pointsto.nodes.AnalysisArraysCopyOfNode.length(AnalysisArraysCopyOfNode.java)
at org.graalvm.compiler.nodes.util.GraphUtil.arrayLength(GraphUtil.java:677)
at org.graalvm.compiler.nodes.java.ArrayLengthNode.readArrayLength(ArrayLengthNode.java:122)
at org.graalvm.compiler.nodes.java.ArrayLengthNode.canonical(ArrayLengthNode.java:86)
at org.graalvm.compiler.nodes.java.ArrayLengthNode.canonical(ArrayLengthNode.java:51)
at org.graalvm.compiler.graph.spi.Canonicalizable$Unary.canonical(Canonicalizable.java:97)
at org.graalvm.compiler.phases.common.CanonicalizerPhase$Instance.tryCanonicalize(CanonicalizerPhase.java:333)
... 22 more
Is that somehow related to mixing compiling using graal-sdk.jar from one version (1.0.0-rc1) but native image from another?
I haven't worked out how to use the git built graal for both building my jar and doing native image I'm afraid :(
I've pushed an update to the repo including these new changes (using 1.0.0-rc1): https://gitlab.com/sundbp/native-image-grpc-test
How did you build it including the substitutions?
Realized that error was because of a recent commit to svm a few days ago. Rolled back graal git to commit 1b109ea1c9ed59239c5e266ee69ba7350fd9ce37
. Then I'm able to build incl the substitutions.
I've pushed updates to the reproduction repo at https://gitlab.com/sundbp/native-image-grpc-test
Like you describe I still get the same error as before in PosixJavaNIOSubstitutions
.
Dumping packages we can see for the non-native image case:
Native image case:
Looks like the initial handshake takes place in both sessions, but then for in native-image case it ends up closing down straight away. Seems consistent with the stacktrace - roughly says that connection was opened, then as part of negotiating protocol (at gRPC level) it does a write+flush, which triggers a write to socket and it breaks.
Added a printout to the args passed to writev before it crashes:
System.out.println("Address is " + String.format("0x%X", address) + " with len " + len + ". fdo is " + fdo.valid() + " and fd is " + fd);
gives:
Address is 0x7FE91A602DB0 with len 2. fdo is true and fd is 27
So that doesn't obviously look "bad".
One thing that is a little suspicious to me is that fd always seems to be 27 (regardless if I open some other files, or some nc -lk etc on the same machine). I'm not sure if that is normal or not.
Starting to get to the end of my ideas of things to look at.
Happy to keep digging if anyone has any input on sensible next steps!
Another experiment - check lsof of the process before I've created gRPC client, after I've created it but before I've sent any API request over it, and after it has crashed: Before creating gRPC client:
» lsof -p (pgrep cli-client) 14:32:33
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cli-clien 20537 sundbp cwd DIR 1,4 640 8604128240 /Users/sundbp/dev/native-image-grpc-test
cli-clien 20537 sundbp txt REG 1,4 34982976 8607168028 /Users/sundbp/dev/native-image-grpc-test/cli-client
cli-clien 20537 sundbp txt REG 1,4 841440 8604675997 /usr/lib/dyld
cli-clien 20537 sundbp txt REG 1,4 1175293952 8604684784 /private/var/db/dyld/dyld_shared_cache_x86_64h
cli-clien 20537 sundbp 0u CHR 16,10 0t832405 8425 /dev/ttys010
cli-clien 20537 sundbp 1u CHR 16,10 0t832405 8425 /dev/ttys010
cli-clien 20537 sundbp 2u CHR 16,10 0t832405 8425 /dev/ttys010
After creating client, but before API call:
» lsof -p (pgrep cli-client) 14:35:33
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cli-clien 20537 sundbp cwd DIR 1,4 640 8604128240 /Users/sundbp/dev/native-image-grpc-test
cli-clien 20537 sundbp txt REG 1,4 34982976 8607168028 /Users/sundbp/dev/native-image-grpc-test/cli-client
cli-clien 20537 sundbp txt REG 1,4 841440 8604675997 /usr/lib/dyld
cli-clien 20537 sundbp txt REG 1,4 1175293952 8604684784 /private/var/db/dyld/dyld_shared_cache_x86_64h
cli-clien 20537 sundbp 0u CHR 16,10 0t832437 8425 /dev/ttys010
cli-clien 20537 sundbp 1u CHR 16,10 0t832437 8425 /dev/ttys010
cli-clien 20537 sundbp 2u CHR 16,10 0t832437 8425 /dev/ttys010
cli-clien 20537 sundbp 3 PIPE 0xf348d64c968cceb7 16384 ->0xf348d64c968cd6f7
cli-clien 20537 sundbp 4 PIPE 0xf348d64c968cd6f7 16384 ->0xf348d64c968cceb7
cli-clien 20537 sundbp 5u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 6 PIPE 0xf348d64c968cb837 16384 ->0xf348d64c968cc5b7
cli-clien 20537 sundbp 7 PIPE 0xf348d64c968cc5b7 16384 ->0xf348d64c968cb837
cli-clien 20537 sundbp 8u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 9 PIPE 0xf348d64c968ccf77 16384 ->0xf348d64caea635b7
cli-clien 20537 sundbp 10 PIPE 0xf348d64caea635b7 16384 ->0xf348d64c968ccf77
cli-clien 20537 sundbp 11u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 12 PIPE 0xf348d64caea64877 16384 ->0xf348d64caea62a77
cli-clien 20537 sundbp 13 PIPE 0xf348d64caea62a77 16384 ->0xf348d64caea64877
cli-clien 20537 sundbp 14u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 15 PIPE 0xf348d64caea63c77 16384 ->0xf348d64caea62ef7
cli-clien 20537 sundbp 16 PIPE 0xf348d64caea62ef7 16384 ->0xf348d64caea63c77
cli-clien 20537 sundbp 17u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 18 PIPE 0xf348d64caea647b7 16384 ->0xf348d64caea62fb7
cli-clien 20537 sundbp 19 PIPE 0xf348d64caea62fb7 16384 ->0xf348d64caea647b7
cli-clien 20537 sundbp 20u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 21 PIPE 0xf348d64caea64577 16384 ->0xf348d64caea64db7
cli-clien 20537 sundbp 22 PIPE 0xf348d64caea64db7 16384 ->0xf348d64caea64577
cli-clien 20537 sundbp 23u KQUEUE count=0, state=0x8
cli-clien 20537 sundbp 24 PIPE 0xf348d64c9a770037 16384 ->0xf348d64c9a76ff77
cli-clien 20537 sundbp 25 PIPE 0xf348d64c9a76ff77 16384 ->0xf348d64c9a770037
cli-clien 20537 sundbp 26u KQUEUE count=0, state=0x8
After crash (before process exits):
» lsof -p (pgrep cli-client) 14:36:25
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
cli-clien 20537 sundbp cwd DIR 1,4 640 8604128240 /Users/sundbp/dev/native-image-grpc-test
cli-clien 20537 sundbp txt REG 1,4 34982976 8607168028 /Users/sundbp/dev/native-image-grpc-test/cli-client
cli-clien 20537 sundbp txt REG 1,4 841440 8604675997 /usr/lib/dyld
cli-clien 20537 sundbp txt REG 1,4 1175293952 8604684784 /private/var/db/dyld/dyld_shared_cache_x86_64h
cli-clien 20537 sundbp 0u CHR 16,10 0t839452 8425 /dev/ttys010
cli-clien 20537 sundbp 1u CHR 16,10 0t839452 8425 /dev/ttys010
cli-clien 20537 sundbp 2u CHR 16,10 0t839452 8425 /dev/ttys010
cli-clien 20537 sundbp 28u unix 0xf348d64c84366b57 0t0 ->(none)
Not making me that much wiser, but at least I see how 27 is the "next fd in line" (and obviously FDs are per process, not global as I temporarily was thinking).
Given the EFAULT errno (indicated by "Bad address") it seems relatively conclusive to me that it's the pointers passed via the iovec to writev
that is the problem then.
If I follow the path from netty to SocketChannelImpl to IOUtil to SocketDispatcher and finally to the substitutions - then it seems the most relevant bit is here: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/IOUtil.java/#148
That's where ByteBuffers are translated to IOVecWrapper where the native memory lives.
The SocketDispatcher after that is just passing it on to writev0: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/SocketDispatcher.java/#51
So the final bit of code I can see is the WordFactory.pointer(address)
call to get the iovec that is passed on to writev.
WordFactory.pointer seems to use WordBoxFactory.box with the address that we passed in. This must be implemented in BoxFactoryImpl in compiler/src/org.graalvm.compiler.word/src/org/graalvm/compiler/word/Word.java
if I understand correctly, and this uses HostedWord.boxLong with the address, which just returns a new HostedWord(address)
.
It doesn't strike me that this should have any impact on pointer finally ending up in the writev
call.
So I'm at the end of my road here I think - input most welcome! @cstancu @Peter-B-Kessler
A quick check that the iovec address doesn't get altered:
System.out.println("Address is " + String.format("0x%X", address) + " with len " + len + ". fdo is " + fdo.valid() + " and fd is " + fd);
System.out.println("the iov null check: " + iov.isNull() + " and the raw value: " + String.format("0x%X", iov.rawValue()));
gives:
Address is 0x7FEBA7F001E0 with len 2. fdo is true and fd is 27
the iov null check: false and the raw value: 0x7FEBA7F001E0
Where are your System.out.println
s shown just above? Are they in Target_sun_nio_ch_FileDispatcherImpl.writev0(FileDescriptor, long, int)
?
Is the address
/iov
a valid address? Is it from a new IOVecWrapper
, or reusing a cached one? I assume the Deallocator
has not already gone off, or the address would be 0. If you follow the address, does it look like a (C) list of struct iovec*
of the right length?
Can you follow the arguments through SocketDispatcher.writev(FileDescriptor, long, int)
to Target_sun_nio_ch_FileDispatcherImpl.writev0(FileDescriptor, long, int)
. That is, making sure that address
and iov
have the same (valid) value. They should.
From there we go directly to Uio.writev(int, iovec, int)
which just calls the C writev
function on your platform.
What is the returned result from Uio.writev(int, iovec, int)
? Is it an indication of success (>0), or an error code (<0). See PosixJavaNIOSubstitutions.convertLongReturnVal(long, boolean)
for details on the interpretation of the result.
Thanks for your diagnosis of this problem.
Yes, more exactly line 1355 in substratevm/src/com.oracle.svm.core.posix/src/com/oracle/svm/core/posix/PosixJavaNIOSubstitutions.java
.
I'm not quite sure how to investigate the address / iov
. How would I attempt to print out the content of the memory address here?
I'm not a java developer, not using any IDE/debugger - happy for brief instruction of useful tooling you'd use here!
The relevant code for setting up the IOVecWrapper must be line 110 to 148 here: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/sun/nio/ch/IOUtil.java/#110 Without some runtime info I can't determine if it's cached or not..
I can't seem to find the source for SocketDispatcher anywhere on the graal side, so I'm not sure how I'd modify be able to observe the values passing through there..
Uio.writev definitely returns n < 0 as we hit the case where we call throwIOExceptionWithLastError()
on line 305 (in PosixJavaNioSubstitutions). That's where it indicates we got errno EFAULT -> "Bad address".
Sounds like my current tooling may need an upgrade to dig a level deeper..
Since you are comfortable with modifying the source and rebuilding, but not debugging, for "tooling" I would recommend adding tracing code.
There is nothing for SocketDispatcher
on the graal side. We use the straight http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/556b17038b5c/src/solaris/classes/sun/nio/ch/SocketDispatcher.java, which just trampolines to the corresponding native methods FileDispatcherImpl
(e.g., http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/556b17038b5c/src/solaris/native/sun/nio/ch/FileDispatcherImpl.c#l128), which we do substitute.
What do you get if you replace Target_sun_nio_ch_FileDispatcherImpl.writev0(FileDescriptor, long, int)
with
@Substitute
private static long writev0(FileDescriptor fdo, long address, int len) throws IOException {
final Log trace = Log.log().string("[Target_sun_nio_ch_FileDispatcherImpl.writev0:");
int fd = fdval(fdo);
trace.string(" fd: ").signed(fd);
iovec iov = WordFactory.pointer(address);
trace.string(" ion: ").hex(iov);
final SignedWord writevResult = writev(fd, iov, len);
trace.string(" writevResult: ").signed(writevResult);
trace.string(" errno: ").signed(Errno.errno());
final long result = convertLongReturnVal(writevResult, false);
trace.string(" result: ").signed(result);
trace.string("]").newline();
return result;
}
(with the usual caveats about code written in a post, not in an IDE. :-) Log
is our low-level printf
-debugging mechanism, from when we did not have System.out.println
working.
If writevResult
is < 0, then I suspect you will never get past the call to convertLongReturnVal
because you will get an IOException thrown at you. (You could catch the exception, log it, and rethrow it. But that would not tell you anything new.)
Rummaging through the fields of iov
as if it were a C struct iovec*
would be more work. We do have Uio.iovec
, which would let you get at the iov_base
and iov_len
fields, but nothing that would help you find out which of the addresses is bad (the struct iovec*
in the C array, or the base addresses in the struct iovec
s themselves). As you say, the assembling of the iov
happens in IOUtil.write
, with the storage allocated (and freed!) in IOVecWrapper
.
@Peter-B-Kessler I ended up first sick and then busy - sorry about delay.
I ran through that trace:
[Target_sun_nio_ch_FileDispatcherImpl.writev0: fd: 27 iov: 0x7fc3bec00c20 writevResult: -1 errno: 14Exception in thread "main" java.lang.reflect.InvocationTargetException
...
The call to convertLongReturnVal
is where the exception is thrown on unclean returns so it's natural we don't see the result:
part of the trace.
As I expected it shows nothing I didn't expect from other traces etc I've done.
When I say I'm not using a debugger that doesn't mean I wouldn't be happy to give it whirl. It's just not part of my standard day to day setup when I deal with clojure. If you have suggestions for what to use and rough guidelines for inspecting the iovec struct I'd be happy to give it a go.
It does strike me as that's what we'll need to get to the next step of why we get bad address. To then be able to work out why/when/how it ends up "bad" (or if it starts "bad").
@Peter-B-Kessler happy to dig more - incl debuggers etc. If you have the time to briefly describe what you'd do next I'm happy to try to run with it.
I seem to have run into the exact same issue with the latest master build of substratevm:
Aug 02, 2018 5:49:17 PM io.vertx.core.net.impl.ConnectionBase
SEVERE: Unhandled exception
java.io.IOException: Bad address
at java.lang.Throwable.<init>(Throwable.java:265)
at java.lang.Exception.<init>(Exception.java:66)
at java.io.IOException.<init>(IOException.java:58)
at com.oracle.svm.core.posix.PosixJavaNIOSubstitutions.throwIOExceptionWithLastError(PosixJavaNIOSubstitutions.java:236)
at com.oracle.svm.core.posix.PosixJavaNIOSubstitutions.convertLongReturnVal(PosixJavaNIOSubstitutions.java:304)
at com.oracle.svm.core.posix.PosixJavaNIOSubstitutions.convertLongReturnVal(PosixJavaNIOSubstitutions.java:285)
at com.oracle.svm.core.posix.PosixJavaNIOSubstitutions$Target_sun_nio_ch_FileDispatcherImpl.writev0(PosixJavaNIOSubstitutions.java:1401)
at sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
at sun.nio.ch.IOUtil.write(IOUtil.java:148)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:504)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:418)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.vertx.core.net.impl.ConnectionBase.endReadAndFlush(ConnectionBase.java:83)
at io.vertx.core.net.impl.VertxHandler$$Lambda$968/35805262.run(Unknown Source)
at io.vertx.core.impl.ContextImpl.lambda$wrapTask$2(ContextImpl.java:337)
at io.vertx.core.impl.ContextImpl$$Lambda$665/1290664511.run(Unknown Source)
at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:195)
at io.vertx.core.net.impl.VertxHandler.channelReadComplete(VertxHandler.java:136)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:398)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:380)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelReadComplete(AbstractChannelHandlerContext.java:373)
at io.netty.handler.codec.ByteToMessageDecoder.channelReadComplete(ByteToMessageDecoder.java:324)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:398)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:380)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelReadComplete(AbstractChannelHandlerContext.java:373)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelReadComplete(DefaultChannelPipeline.java:1364)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:398)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:380)
at io.netty.channel.DefaultChannelPipeline.fireChannelReadComplete(DefaultChannelPipeline.java:941)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:146)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
at com.oracle.svm.core.posix.thread.PosixJavaThreads.pthreadStartRoutine(PosixJavaThreads.java:238)
Does this failure happen on every call to sun.nio.ch.SocketChannelImpl.write
(or methods called from there) or just some of them? That is: am I looking for an intermittent failure or a solid failure? I guess if it were a solid failure someone would have presented a reproducing test case.
Does it only happen when sun.nio.ch.IOUtil.write
is called when ByteBuffer[] srcs
has more than one element? It looks like maybe there is not a lot of visibility in the netty
levels.
I can try to write some iovec
verification code. I can only add tracing code to our substitutions. That will not fix the problem, but it might give us some idea about what is wrong.
Does anyone who hits this run with -XX:+PrintGC -XX:+PrintGCTimeStamps
to see if this failure follows closely (?) after a garbage collection? This is just general paranoia on my part what with raw pointers flying around. They should all be to DirectBuffer
memory (that is, not moved by an inopportune collection), and they should not be freed by Cleaners
, but if we could rule that out it would be good.
In my case it's 100% reproductible: I start an http server, and every time I connect to it I get this error. I can try to make the test smaller, but otherwise I can also give you the jar if it helps?
I haven't tried to exercise the function through other code paths than the one through netty that's shown in my reproduction project mentioned elsewhere in this issue. For that reproduction case it's 100% reproducible as far as I can tell.
On Fri, Aug 3, 2018 at 9:14 AM Stéphane Épardaud notifications@github.com wrote:
In my case it's 100% reproductible: I start an http server, and every time I connect to it I get this error. I can try to make the test smaller, but otherwise I can also give you the jar if it helps?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/graal/issues/400#issuecomment-410180258, or mute the thread https://github.com/notifications/unsubscribe-auth/AADxJ0YgGPfL7dcciTfnkpsGG9gU6rrsks5uNAZVgaJpZM4Tx1k3 .
I've managed to reproduce this with a very small vertx use-case, where it only happens if I use the Chunked HTTP encoding.
Awesome. Can you share it in a public git repository?
On Fri, Aug 3, 2018 at 3:07 PM Stéphane Épardaud notifications@github.com wrote:
I've managed to reproduce this with a very small vertx use-case, where it only happens if I use the Chunked HTTP encoding.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/graal/issues/400#issuecomment-410260051, or mute the thread https://github.com/notifications/unsubscribe-auth/AADxJ1djqj0s1ZWFSqiofooABe1_h320ks5uNFkQgaJpZM4Tx1k3 .
Cool - and setting setChunked(false)
instead makes it pass, right?
On Fri, Aug 3, 2018 at 3:58 PM Stéphane Épardaud notifications@github.com wrote:
Sure: https://github.com/FroMage/native-vertx-chunked-fail
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/oracle/graal/issues/400#issuecomment-410279217, or mute the thread https://github.com/notifications/unsubscribe-auth/AADxJ6K4vlSrenPRwfsOn1Am06_58jOgks5uNGUQgaJpZM4Tx1k3 .
When I build an image of native-vert-chunked-fail
I see
$ mx native-image -g --report-unsupported-elements-at-runtime -Dio.netty.noUnsafe=true -H:EnableURLProtocols=http -Djava.net.preferIPv4Stack=true -H:ReflectionConfigurationFiles=../../native-vertx-chunked-fail-master/reflect.json -jar ../../native-vertx-chunked-fail-master/target/native-vertx-chunked-fail-0.0.1-SNAPSHOT.jar
classlist: 5,876.85 ms
(cap): 1,838.74 ms
setup: 4,043.60 ms
Warning: RecomputeFieldValue.ArrayIndexScale automatic substitution failed. The automatic substitution registration was attempted because a call to sun.misc.Unsafe.arrayIndexScale(Class) was detected in the static initializer of io.netty.util.internal.shaded.org.jctools.util.UnsafeRefArrayAccess. Detailed failure reason(s): Could not determine the field where the value produced by the call to sun.misc.Unsafe.arrayIndexScale(Class) for the array index scale computation is stored. The call is not directly followed by a field store or by a sign extend node followed directly by a field store.
(typeflow): 28,721.37 ms
(objects): 18,540.57 ms
(features): 209.08 ms
analysis: 48,554.12 ms
universe: 1,475.05 ms
(parse): 9,714.22 ms
(inline): 6,294.66 ms
(compile): 56,843.34 ms
compile: 75,483.35 ms
image: 6,615.42 ms
debuginfo: 12,386.15 ms
write: 8,850.38 ms
[total]: 163,432.16 ms
and, needless to say, that warning about missing an RecomputeFieldValue.ArrayIndexScale
is worrying since I now know that unsafe access of arrays is happening in this code.
I am going to hand this over to our RecomputeFieldValue
guy, @cstancu, to see if he has any wise words to say.
A nit on the otherwise exemplary README.txt file: The lines
- ./native-vertx-chunked-fail-0.0.1-SNAPSHOT
Load http://localhost:9000/
made me (just a VM engineer) think that "Load http://localhost:9000/" was output from the application, so I stared at it not doing anything until I thought to take a browser and point it at that URL.
But I can reliably reproduce the problem. I get (at least) two of them per refresh in the browser.
I added some tracing to Target_sun_nio_ch_FileDispatcherImpl.writev0(FileDescriptor, long, int)
. That is the first place where the call chain enters the VM. Here is what I see:
[Target_sun_nio_ch_FileDispatcherImpl.writev0: fd: 34 iov: 0x7fffe0000970 len: 5 sizeof(struct iovec): 16
First-level of the iovec[]:
&iovec[ 0 ]: 0x7fffe0000970 .iov_base: 0x7ffff4601010 .iov_len: 47
&iovec[ 1 ]: 0x7fffe0000980 .iov_base: 0x7ffff4603010 .iov_len: 4
&iovec[ 2 ]: 0x7fffe0000990 .iov_base: 0x7ffff4603020 .iov_len: 16
&iovec[ 3 ]: 0x7fffe00009a0 .iov_base: 0x7fbf6802a400 .iov_len: 2
&iovec[ 4 ]: 0x7fffe00009b0 .iov_base: 0x7fbf6802a420 .iov_len: 5
Second-level of the iovec[]:
&iovec[ 0 ]: 0x7fffe0000970 .iov_base: 0x7ffff4601010 .iov_len: 47
&.iov_base[ 0 ]: 0x7ffff4601010 .iov_base[ 0 ]: 0x48 = H
&.iov_base[ 1 ]: 0x7ffff4601011 .iov_base[ 1 ]: 0x54 = T
&.iov_base[ 2 ]: 0x7ffff4601012 .iov_base[ 2 ]: 0x54 = T
&.iov_base[ 3 ]: 0x7ffff4601013 .iov_base[ 3 ]: 0x50 = P
&.iov_base[ 4 ]: 0x7ffff4601014 .iov_base[ 4 ]: 0x2f = /
&.iov_base[ 5 ]: 0x7ffff4601015 .iov_base[ 5 ]: 0x31 = 1
&.iov_base[ 6 ]: 0x7ffff4601016 .iov_base[ 6 ]: 0x2e = .
&.iov_base[ 7 ]: 0x7ffff4601017 .iov_base[ 7 ]: 0x31 = 1
&.iov_base[ 8 ]: 0x7ffff4601018 .iov_base[ 8 ]: 0x20 =
&.iov_base[ 9 ]: 0x7ffff4601019 .iov_base[ 9 ]: 0x32 = 2
&.iov_base[ 10 ]: 0x7ffff460101a .iov_base[ 10 ]: 0x30 = 0
&.iov_base[ 11 ]: 0x7ffff460101b .iov_base[ 11 ]: 0x30 = 0
&.iov_base[ 12 ]: 0x7ffff460101c .iov_base[ 12 ]: 0x20 =
&.iov_base[ 13 ]: 0x7ffff460101d .iov_base[ 13 ]: 0x4f = O
&.iov_base[ 14 ]: 0x7ffff460101e .iov_base[ 14 ]: 0x4b = K
&.iov_base[ 15 ]: 0x7ffff460101f .iov_base[ 15 ]: 0xd =
&.iov_base[ 16 ]: 0x7ffff4601020 .iov_base[ 16 ]: 0xa =
&.iov_base[ 17 ]: 0x7ffff4601021 .iov_base[ 17 ]: 0x54 = T
&.iov_base[ 18 ]: 0x7ffff4601022 .iov_base[ 18 ]: 0x72 = r
&.iov_base[ 19 ]: 0x7ffff4601023 .iov_base[ 19 ]: 0x61 = a
&.iov_base[ 20 ]: 0x7ffff4601024 .iov_base[ 20 ]: 0x6e = n
&.iov_base[ 21 ]: 0x7ffff4601025 .iov_base[ 21 ]: 0x73 = s
&.iov_base[ 22 ]: 0x7ffff4601026 .iov_base[ 22 ]: 0x66 = f
&.iov_base[ 23 ]: 0x7ffff4601027 .iov_base[ 23 ]: 0x65 = e
&.iov_base[ 24 ]: 0x7ffff4601028 .iov_base[ 24 ]: 0x72 = r
&.iov_base[ 25 ]: 0x7ffff4601029 .iov_base[ 25 ]: 0x2d = -
&.iov_base[ 26 ]: 0x7ffff460102a .iov_base[ 26 ]: 0x45 = E
&.iov_base[ 27 ]: 0x7ffff460102b .iov_base[ 27 ]: 0x6e = n
&.iov_base[ 28 ]: 0x7ffff460102c .iov_base[ 28 ]: 0x63 = c
&.iov_base[ 29 ]: 0x7ffff460102d .iov_base[ 29 ]: 0x6f = o
&.iov_base[ 30 ]: 0x7ffff460102e .iov_base[ 30 ]: 0x64 = d
&.iov_base[ 31 ]: 0x7ffff460102f .iov_base[ 31 ]: 0x69 = i
&.iov_base[ 32 ]: 0x7ffff4601030 .iov_base[ 32 ]: 0x6e = n
&.iov_base[ 33 ]: 0x7ffff4601031 .iov_base[ 33 ]: 0x67 = g
&.iov_base[ 34 ]: 0x7ffff4601032 .iov_base[ 34 ]: 0x3a = :
&.iov_base[ 35 ]: 0x7ffff4601033 .iov_base[ 35 ]: 0x20 =
&.iov_base[ 36 ]: 0x7ffff4601034 .iov_base[ 36 ]: 0x63 = c
&.iov_base[ 37 ]: 0x7ffff4601035 .iov_base[ 37 ]: 0x68 = h
&.iov_base[ 38 ]: 0x7ffff4601036 .iov_base[ 38 ]: 0x75 = u
&.iov_base[ 39 ]: 0x7ffff4601037 .iov_base[ 39 ]: 0x6e = n
&.iov_base[ 40 ]: 0x7ffff4601038 .iov_base[ 40 ]: 0x6b = k
&.iov_base[ 41 ]: 0x7ffff4601039 .iov_base[ 41 ]: 0x65 = e
&.iov_base[ 42 ]: 0x7ffff460103a .iov_base[ 42 ]: 0x64 = d
&.iov_base[ 43 ]: 0x7ffff460103b .iov_base[ 43 ]: 0xd =
&.iov_base[ 44 ]: 0x7ffff460103c .iov_base[ 44 ]: 0xa =
&.iov_base[ 45 ]: 0x7ffff460103d .iov_base[ 45 ]: 0xd =
&.iov_base[ 46 ]: 0x7ffff460103e .iov_base[ 46 ]: 0xa =
&iovec[ 1 ]: 0x7fffe0000980 .iov_base: 0x7ffff4603010 .iov_len: 4
&.iov_base[ 0 ]: 0x7ffff4603010 .iov_base[ 0 ]: 0x31 = 1
&.iov_base[ 1 ]: 0x7ffff4603011 .iov_base[ 1 ]: 0x30 = 0
&.iov_base[ 2 ]: 0x7ffff4603012 .iov_base[ 2 ]: 0xd =
&.iov_base[ 3 ]: 0x7ffff4603013 .iov_base[ 3 ]: 0xa =
&iovec[ 2 ]: 0x7fffe0000990 .iov_base: 0x7ffff4603020 .iov_len: 16
&.iov_base[ 0 ]: 0x7ffff4603020 .iov_base[ 0 ]: 0x48 = H
&.iov_base[ 1 ]: 0x7ffff4603021 .iov_base[ 1 ]: 0x69 = i
&.iov_base[ 2 ]: 0x7ffff4603022 .iov_base[ 2 ]: 0x20 =
&.iov_base[ 3 ]: 0x7ffff4603023 .iov_base[ 3 ]: 0x66 = f
&.iov_base[ 4 ]: 0x7ffff4603024 .iov_base[ 4 ]: 0x72 = r
&.iov_base[ 5 ]: 0x7ffff4603025 .iov_base[ 5 ]: 0x6f = o
&.iov_base[ 6 ]: 0x7ffff4603026 .iov_base[ 6 ]: 0x6d = m
&.iov_base[ 7 ]: 0x7ffff4603027 .iov_base[ 7 ]: 0x20 =
&.iov_base[ 8 ]: 0x7ffff4603028 .iov_base[ 8 ]: 0x62 = b
&.iov_base[ 9 ]: 0x7ffff4603029 .iov_base[ 9 ]: 0x75 = u
&.iov_base[ 10 ]: 0x7ffff460302a .iov_base[ 10 ]: 0x66 = f
&.iov_base[ 11 ]: 0x7ffff460302b .iov_base[ 11 ]: 0x66 = f
&.iov_base[ 12 ]: 0x7ffff460302c .iov_base[ 12 ]: 0x65 = e
&.iov_base[ 13 ]: 0x7ffff460302d .iov_base[ 13 ]: 0x72 = r
&.iov_base[ 14 ]: 0x7ffff460302e .iov_base[ 14 ]: 0x20 =
&.iov_base[ 15 ]: 0x7ffff460302f .iov_base[ 15 ]: 0x32 = 2
&iovec[ 3 ]: 0x7fffe00009a0 .iov_base: 0x7fbf6802a400 .iov_len: 2
&.iov_base[ 0 ]: 0x7fbf6802a400
Thread 5 "ntloop-thread-1" received signal SIGSEGV, Segmentation fault.
That is, I have been handed an iovec[5]
. The top-level structure seems okay: there are 5 elements, they are correctly laid out in memory (16 bytes each), and they have iov_base
and iov_len
fields that seem plausible. (But I know nothing about what iovec
should look like: the iov_base
fields look like addresses and the iov_len
fields look like lengths.)
Looking at contents of the iov_base
fields fails. The first 3 seem fine. (But again, I have no idea what they should contain: the first looks like some kind of response header ending in an empty line; the second might be length (as a hex string!?) of the next iovec; and the third is the contents of the response.) Then things go pear-shaped. iovec[3].iov_base
points to unmapped memory. Fully justifying writev
returning an error code of "Bad address".
Just for fun (?) I used gdb
to look at the iovec.iov_base
fields:
(gdb) x/47c 0x7ffff4601010
0x7ffff4601010: 'H' 'T' 'T' 'P' '/' '1' '.' '1'
0x7ffff4601018: ' ' '2' '0' '0' ' ' 'O' 'K' '\r'
0x7ffff4601020: '\n' 'T' 'r' 'a' 'n' 's' 'f' 'e'
0x7ffff4601028: 'r' '-' 'E' 'n' 'c' 'o' 'd' 'i'
0x7ffff4601030: 'n' 'g' ':' ' ' 'c' 'h' 'u' 'n'
0x7ffff4601038: 'k' 'e' 'd' '\r' '\n' '\r' '\n'
(gdb) x/4c 0x7ffff4603010
0x7ffff4603010: '1' '0' '\r' '\n'
(gdb) x/16c 0x7ffff4603020
0x7ffff4603020: 'H' 'i' ' ' 'f' 'r' 'o' 'm' ' '
0x7ffff4603028: 'b' 'u' 'f' 'f' 'e' 'r' ' ' '2'
(gdb) x/2c 0x7fbf6802a400
0x7fbf6802a400: Cannot access memory at address 0x7fbf6802a400
(gdb) x/5c 0x7fbf6802a420
0x7fbf6802a420: Cannot access memory at address 0x7fbf6802a420
Not much news there. Not only is iovec[3].iov_base
a bad address, but so is iovec[4].iov_base
.
I am hoping that the problem is in the RecomputeFieldValue.ArrayIndexScale
warning, and that @cstancu can figure it out.
I added some tracing to Target_sun_misc_Unsafe.allocateMemory
that shows me
[Target_sun_misc_Unsafe.allocateMemory: bytes: 16777216 result: 0x7ffff45ff010]
[Target_sun_misc_Unsafe.allocateMemory: bytes: 80 result: 0x7fffe0000970]
[Target_sun_nio_ch_FileDispatcherImpl.writev0: fd: 34 iov: 0x7fffe0000970 len: 5 sizeof(struct iovec): 16
First-level of the iovec[]:
&iovec[ 0 ]: 0x7fffe0000970 .iov_base: 0x7ffff4601010 .iov_len: 47
&iovec[ 1 ]: 0x7fffe0000980 .iov_base: 0x7ffff4603010 .iov_len: 4
&iovec[ 2 ]: 0x7fffe0000990 .iov_base: 0x7ffff4603020 .iov_len: 16
&iovec[ 3 ]: 0x7fffe00009a0 .iov_base: 0x7f57bc2c47f0 .iov_len: 2
&iovec[ 4 ]: 0x7fffe00009b0 .iov_base: 0x7f57bc2c4810 .iov_len: 5
The iovec[]
fits exactly in the 80 bytes at 0x7fffe0000970
. The first 3 iov_base
s are in the middle of the 16MB (!) at 0x7ffff45ff010
. 0x7f57bc2c47f0
and 0x7f57bc2c4810
point to unallocated memory.
I still do not know whose fault that is. Should there be 5 iovec
s in the response, or is 3 the right answer?
I was hoping that https://github.com/oracle/graal/issues/400#issuecomment-410298492 meant I could set setChunked(false)
and watch what happens on a successful run. But if I do that I get
SEVERE: Unhandled exception
java.lang.IllegalStateException: You must set the Content-Length header to be the total size of the message body BEFORE sending any data if you are not using HTTP chunked encoding.
I am obviously out of my depth.
Sorry about the lack of working version, I just pushed it:
vertx.createHttpServer().requestHandler(req -> {
// this doesn't work
req.response().setChunked(true);
// this alternative works
// req.response().putHeader("Content-Length", "16");
req.response().write(Buffer.buffer("Hi from buffer 2"));
req.response().end();
}).listen(9000);
Basically if you're not chunking, you have to set the http content length instead.
I was able to get rid of the unsafe warning by following https://medium.com/graalvm/instant-netty-startup-using-graalvm-native-image-generation-ed6f14ff7692 and adding this substitution:
@TargetClass(className = "io.netty.util.internal.shaded.org.jctools.util.UnsafeRefArrayAccess")
final class Target_io_netty_util_internal_shaded_org_jctools_util_UnsafeRefArrayAccess {
@Alias @RecomputeFieldValue(kind = Kind.ArrayIndexShift, declClass = Object[].class)
public static int REF_ELEMENT_SHIFT;
}
However, that does not appear related to this particular bad address issue, because it does not fix it.
The underlying problem is that the netty code constructs DirectByteBuffer
instances in static initializers. We run the static initializers during image building, so we currently build an image with DirectByteBuffer
instances whose address()
method returns the address of memory in the image builder, which is long gone by the time we are running the image. When those DirectByteBuffer
instances are used, if we are lucky we get a segfault. (If we are unlucky we write on memory. Or read from memory: that would be an issue, too.)
I will add a detector so the problem source code will be identified and we will not construct images with DirectByteBuffer
instances in their image heaps. That will not fix the underlying problems in netty.
Thanks. Do you have any pointer to the netty code that does this?
I only saw places where netty would do that only if -Dio.netty.noUnsafe=true
was missing, which is not the case here.
Also, I assume you guys must have a pretty damn good reason to run static init code during compilation and not run time, right? Why do you do that? Because it sounds very dangerous to me since those static init blocks are expected to be called at run-time in the Java semantics. There must be a ton of code that does IO in static blocks and they must surely break if that is not done at run-time. I'm sure you must get that question a lot, sorry if it's documented somewhere and I didn't see it, but perhaps you can point me to it?
That we run static initializers during image building is documented as a limitation.
There are lots of reasons to want to run static initializers during image building (initialization order, faster startup of the generated image, etc.). We are working on a mechanism to allow static initializers to be delayed until runtime, but it is not ready quite yet.
Thanks for the explanation. I haven't run any kind of study on how most static init is written in Maven Central, but 20 years of writing Java code tells me that I expect a significant percentage (>20%) of init code that can only work at runtime, so I strongly suspect that the new mechanism to run static init at run-time will be extremely useful in getting most apps running on graal.
OK, I found it: it was in io.netty.handler.codec.http.HttpObjectEncoder that we had static direct buffers for the last chunks. I was able to fix them with this subst:
import static io.netty.buffer.Unpooled.buffer;
import static io.netty.buffer.Unpooled.directBuffer;
import static io.netty.buffer.Unpooled.unreleasableBuffer;
import static io.netty.handler.codec.http.HttpConstants.CR;
import static io.netty.handler.codec.http.HttpConstants.LF;
import com.oracle.svm.core.annotate.Alias;
import com.oracle.svm.core.annotate.RecomputeFieldValue;
import com.oracle.svm.core.annotate.RecomputeFieldValue.Kind;
import io.netty.buffer.Unpooled;
import io.netty.buffer.ByteBuf;
import jdk.vm.ci.meta.MetaAccessProvider;
import jdk.vm.ci.meta.ResolvedJavaField;
@TargetClass(className = "io.netty.handler.codec.http.HttpObjectEncoder")
final class Target_io_netty_handler_codec_http_HttpObjectEncoder {
@Alias @RecomputeFieldValue(kind = Kind.Custom, declClass = Recomputer1.class)
private static ByteBuf CRLF_BUF;
@Alias @RecomputeFieldValue(kind = Kind.Custom, declClass = Recomputer2.class)
private static ByteBuf ZERO_CRLF_CRLF_BUF;
}
final class Recomputer1 implements RecomputeFieldValue.CustomFieldValueComputer {
@Override
public Object compute(MetaAccessProvider metaAccess, ResolvedJavaField original, ResolvedJavaField annotated,
Object receiver) {
return Unpooled.unreleasableBuffer(buffer(2).writeByte(CR).writeByte(LF));
}
}
final class Recomputer2 implements RecomputeFieldValue.CustomFieldValueComputer {
private static final byte[] ZERO_CRLF_CRLF = { '0', CR, LF, CR, LF };
@Override
public Object compute(MetaAccessProvider metaAccess, ResolvedJavaField original, ResolvedJavaField annotated,
Object receiver) {
return Unpooled.unreleasableBuffer(buffer(ZERO_CRLF_CRLF.length)
.writeBytes(ZERO_CRLF_CRLF));
}
}
I think I see
(A) io.netty.handler.codec.http.HttpObjectEncoder.CRLF_BUF = unreleasableBuffer(directBuffer(2).writeByte(CR).writeByte(LF));
public static ByteBuf io.netty.buffer.Unpooled.directBuffer(int initialCapacity) {
return ALLOC.directBuffer(initialCapacity);
}
ByteBuf io.netty.buffer.ByteBufAllocator.directBuffer(int initialCapacity);
public ByteBuf io.netty.buffer.AbstractByteBufAllocator.directBuffer(int initialCapacity) {
return directBuffer(initialCapacity, DEFAULT_MAX_CAPACITY);
}
public ByteBuf io.netty.buffer.AbstractByteBufAllocator.directBuffer(int initialCapacity, int maxCapacity) {
....
return newDirectBuffer(initialCapacity, maxCapacity);
}
(B) protected abstract ByteBuf io.netty.buffer.AbstractByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity);
(C1) protected ByteBuf io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity) {
....
buf = PlatformDependent.hasUnsafe() ?
UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity) :
new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
}
(D) public io.netty.buffer.UnpooledDirectByteBuf.UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
....
(E) setByteBuffer(ByteBuffer.allocateDirect(initialCapacity));
}
(C2) protected ByteBuf io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity) {
....
buf = new InstrumentedUnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
}
InstrumentedUnpooledDirectByteBuf(UnpooledByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
super(alloc, initialCapacity, maxCapacity);
}
(D) public io.netty.buffer.UnpooledDirectByteBuf.UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity) {
....
(E) setByteBuffer(ByteBuffer.allocateDirect(initialCapacity));
}
Starting from the static initialization during image building of (A) io.netty.handler.codec.http.HttpObjectEncoder.CRLF_BUF
, we call down to the abstract method (B) io.netty.buffer.AbstractByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity)
, which could be either (C1) io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity)
or (C2) io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(int initialCapacity, int maxCapacity)
, but both of those end up calling down to (D) io.netty.buffer.UnpooledDirectByteBuf.UnpooledDirectByteBuf(ByteBufAllocator alloc, int initialCapacity, int maxCapacity)
, which calls (E) ByteBuffer.allocateDirect(initialCapacity))
, which allocates a java.nio.DirectByteBuffer
.
Adding a little more tracing to App.main()
to print out CRLF_BUF.toString()
and CRLF_BUF.isDirect()
, I see
io.netty.handler.codec.http.HttpObjectEncoder.CRLF_BUF: UnreleasableByteBuf(UnpooledByteBufAllocator$InstrumentedUnpooledDirectByteBuf(ridx: 0, widx: 2, cap: 2)) isDirect: true
I think that CRLF_BUF
was a DirectByteBuffer
that was created during image building (e.g., with a long address
field pointing to malloc'ed memory in the image builder). If we are lucky, that address will segfault at runtime.
I do not think we can allow DirectByteBuffer
instances created during image building to persist into the heap at runtime. I do not want to do this analysis for each instance. I am not, yet, sure what to do about them.
Adding some experimental tracing to the image builder to find DirectByteBuffer
instances for the sample program shows
Error: Detected a direct ByteBuffer in the image heap. This is not supported. The object was reached from a static initializer. All static class initialization is done during native image construction, thus a static initializer cannot contain code that captures state dependent on the build machine. Write your own initialization methods and call them explicitly from your main entry point.
Trace:
object io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledDirectByteBuf
object io.netty.buffer.UnreleasableByteBuf
at io.netty.handler.codec.http.HttpObjectEncoder.encodeChunkedContent(HttpObjectEncoder.java:195)
....
Trace:
object io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledDirectByteBuf
object io.netty.buffer.UnreleasableByteBuf
at io.netty.handler.codec.http.HttpObjectEncoder.encodeChunkedContent(HttpObjectEncoder.java:195)
....
Trace:
at parsing io.netty.buffer.CompositeByteBuf.Internalniobuffer(CompositeByteBuf.java:1445)
....
Trace:
at parsing io.netty.buffer.CompositeByteBuf.nioBuffer(CompositeByteBuf.java:1459)
....
Trace:
at parsing io.netty.buffer.CompositeByteBuf.nioBuffers(CompositeByteBuf.java:1482)
....
Trace:
at parsing io.netty.buffer.EmptyByteBuf.internalNioBuffer(EmptyByteBuf.java:934)
....
Trace:
at parsing io.netty.buffer.EmptyByteBuf.nioBuffer(EmptyByteBuf.java:912)
....
Trace:
at parsing io.netty.buffer.EmptyByteBuf.nioBuffers(EmptyByteBuf.java:923)
The "object" trace is the class hierarchy. I don't know if that is useful. The "parsing" trace shows the method whose bytecodes we were examining when the DirectByteBuffer
was encountered. That is not a lot of help, but suggestions for improvements are welcome (since you are the users of these messages). By the time we are building the initial heap for the image we do not know how the object ended up in the heap of the image builder, just that it is there because we ran a static initializer.
I think it would help a lot more if you could instrument the constructor of DirectByteBuffer
to yell on static block initialisation during native-image
. Or better yet, support static init at runtime :)
@FroMage Your workaround that imports from jdk.vm.ci.meta
- what JDK version and options did you use to compile that? I'm keen to see if I can get my gRPC case working with this netty workaround!
I seem to get errors like this whatever I do using java9 or java10:
Compiling 7 Java source files...
ERROR: /Users/sundbp/dev/dotfiles/boot/cache/tmp/Users/sundbp/dev/native-image-grpc-test/jd9/idqqty/grpctest/api/GrpcTestGrpc.java, line 20: cannot find symbol
symbol: class Generated
location: package javax.annotation
ERROR: /Users/sundbp/dev/dotfiles/boot/cache/tmp/Users/sundbp/dev/native-image-grpc-test/jd9/idqqty/SVMSubstitutions.java, line 22: package jdk.vm.ci.meta is not visible
(package jdk.vm.ci.meta is declared in module jdk.internal.vm.ci, which is not in the module graph)
ERROR: /Users/sundbp/dev/dotfiles/boot/cache/tmp/Users/sundbp/dev/native-image-grpc-test/jd9/idqqty/SVMSubstitutions.java, line 23: package jdk.vm.ci.meta is not visible
(package jdk.vm.ci.meta is declared in module jdk.internal.vm.ci, which is not in the module graph)
I've tried added JVM flag --add-modules jdk.internal.vm.ci
, but no cigar.
I think you have to run a JVM that is the jdk8 from Graal (with compiler extensions) or jdk11.
I'm trying to make a CLI tool for accessing a grpc API run using native-image. It's using grpc java libs, and hence grpc-netty. I've managed to make an executable, and via a reflection config file I dealt with netty's ReflectiveChannelFactory. I can see that I can actually create the grpc client and see a few TCP packets between my client and server. So far so good.
When I actually try to make an API call I get an error relating to a failed write to the socket. I tried both on OSX and Linux, same exact outcome.
The stacktrace is:
I'm feeling close to getting it all to work, but stumped at this point now. I'm hoping that more experienced eyes may make something out of why the writev() fails here?
As I said, on connecting the grpc client to the server I can see a few packets of negotiations being sent so it seems to have the "ability to write to socket" working somewhere inside the image..
Very encouraged by the startup time after using native-image! Makes clojure cli tools nice to use.
Thanks!