netty / netty-incubator-transport-io_uring

Apache License 2.0
193 stars 38 forks source link

zlib and io_uring #40

Open AlfieC opened 3 years ago

AlfieC commented 3 years ago

hey,

we have a pretty large codebase so I'll try to pull out the most important parts

we essentially dropped in epoll support whereas before we used nio - no issue there. we later deployed kernel 5 + io_uring with the io_uring module, and we started to have issues. we compress the network stream data with zlib (mostly implemented native to avoid copying bytebuf)

error flows from here: https://github.com/SpigotMC/BungeeCord/blob/master/native/src/main/c/NativeCompressImpl.cpp#L76

always errors showing -2. I would usually attribute this to a bug on our side, but the issue only surfaces when we put the "proxy" type server on io_uring, as epoll and nio work without issue. not sure what kind of logs you guys need but I can try to provide anything requested.

normanmaurer commented 3 years ago

Can you show me how you call the code and the full stacktrace ? -2 is a valid error code (stream error)

Bye Norman

AlfieC commented 3 years ago

these errors only originate when we use this in the pipeline:

    private final BungeeZlib zlib = CompressFactory.zlib.newInstance();

    @Override
    public void handlerAdded(ChannelHandlerContext ctx) throws Exception {
        zlib.init(true, Deflater.DEFAULT_COMPRESSION);
    }

    @Override
    public void handlerRemoved(ChannelHandlerContext ctx) throws Exception {
        zlib.free();
    }

    @Override
    protected void encode(ChannelHandlerContext ctx, ByteBuf msg, ByteBuf out) throws Exception {
        int origSize = msg.readableBytes();
        if (origSize < 256) {
            writeVarInt(0, out);
            out.writeBytes(msg);
        } else {
            writeVarInt(origSize, out);
            zlib.process(msg, out);
        }
    }

    public static void writeVarInt(int val, ByteBuf out) {
        while ((val & -128) != 0) {
            out.writeByte(val & 127 | 128);
            val >>>= 7;
        }

        out.writeByte(val);
    }

code from other side:

    private final int compressionThreshold;
    private final BungeeZlib zlib = CompressFactory.zlib.newInstance();

    @Override
    public void handlerAdded(ChannelHandlerContext ctx) throws Exception
    {
        zlib.init( false, 0 );
    }

    @Override
    public void handlerRemoved(ChannelHandlerContext ctx) throws Exception
    {
        zlib.free();
    }

    @Override
    protected void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) throws Exception
    {
        int size = DefinedPacket.readVarInt( in );
        if ( size == 0 )
        {
            out.add( in.slice().retain() );
            in.skipBytes( in.readableBytes() );
        } else
        {
            Preconditions.checkArgument( size >= compressionThreshold, "Decompressed size %s less than compression threshold %s", size, compressionThreshold);
            ByteBuf decompressed = ctx.alloc().directBuffer();

            try
            {
                zlib.process( in, decompressed );
                Preconditions.checkArgument( decompressed.readableBytes() == size, "Decompressed size %s is not equal to actual decompressed bytes", size, decompressed.readableBytes());

                out.add( decompressed );
                decompressed = null;
            } finally
            {
                if ( decompressed != null )
                {
                    decompressed.release();
                }
            }
        }
    }

apologies, error here is this one:

Preconditions.checkArgument( size >= compressionThreshold, "Decompressed size %s less than compression threshold %s", size, compressionThreshold);

I'd normally attribute this to an error of ours but it only occurs when we use io_uring - no issues using epoll or nio

AlfieC commented 3 years ago

when we remove the checkArgument, there we get the -2 on zlib decompression

HookWoods commented 3 years ago

I've done some test a time ago, and it appears that the decoded buffer has multiple nio buffers, so that's why you got the -2 on Zlib cause it can't decompress multiple buffers

AlfieC commented 3 years ago

I've done some test a time ago, and it appears that the decoded buffer has multiple nio buffers, so that's why you got the -2 on Zlib cause it can't decompress multiple buffers

so is this a limitation of zlib? or bug?

AlfieC commented 3 years ago

I've done some test a time ago, and it appears that the decoded buffer has multiple nio buffers, so that's why you got the -2 on Zlib cause it can't decompress multiple buffers

thinking about this some more, im not sure why the issue only appears on io_uring - on epoll and nio no issue.

HookWoods commented 3 years ago

Actually I've done some more test. I have a custom fork of BungeeCord https://github.com/SpigotMC/BungeeCord with IOUring and a custom fork of PaperSpigot (https://github.com/PaperMC/Paper) with IOUring on. I just try to launch the bungeecord server with io_uring on and the spigot with io_uring on and it's not working. I got this error from the Bungeecord logs [21:35:32] [Netty io_uring Worker #0/INFO]: [HookWood_] disconnected with: Exception Connecting:DecoderException : net.md_5.bungee.jni.NativeCodeException: Unknown z_stream return code : -3 @ io.netty.handler.codec.MessageToMessageDecoder:98

When I launch the spigot server on Epoll, it works. So I don't know why and I'm going to search more things on it, but the zlib compression don't work with IOUring on BungeeCord and Spigot

normanmaurer commented 3 years ago

Can you provide a reproducer that I can run locally ?

normanmaurer commented 3 years ago

@AlfieC @HookWoods ping

HookWoods commented 3 years ago

OK I will set up that when I'm at home (in 3-4h)

Janmm14 commented 3 years ago

I've done some test a time ago, and it appears that the decoded buffer has multiple nio buffers, so that's why you got the -2 on Zlib cause it can't decompress multiple buffers

While CompositeByteBuf exists in netty it only gives us a native address when it just has one component, else it errors. So that cannot be the problem. It is not clear if this is a netty bug or a bug in bungeecord's zlib usage. Bungee's native zlib got an overhaul since the last comment in here in bungeecord, so the issue creator should check it again as well.

I'd suggest to close this issue.

chrisvest commented 3 years ago

It sounds like the only variable between working and non-working systems is the io_uring transport, and in this case the error shows up as a corrupted (I guess) zlib stream. It could be that the io_uring transport doesn't set correct read- or write-offsets on the buffers in some cases, and it just happens to get caught by zlib because it sanity checks the data it gets.

rafi67000 commented 1 year ago

any update on this?

PedroMPagani commented 2 months ago

This seems to still be an issue.