invertedtomato / packing

Library for encoding integers in the minimal number of bits. Includes VLQ and Elias Omega encoding.
MIT License
76 stars 6 forks source link

Alternative API with direct memory access. #2

Closed vpenades closed 5 years ago

vpenades commented 5 years ago

Hi, I've just started using your library, as you said, Fibonacci is great!, I've got a 30% improvement over my VarInt implementation.

My concern is about the codec API.

The IUnsignedCompressor and IUnsignedDecompressor are very clean and very straightforward... but they rely on System.IO.Stream, which has a lot of internal overhead, and is very slow for many operations.

For serialization and deserialization, in my experince, it is faster to just read or write the full binary blob of a file, and do the encoding/decoding job in memory. Yes, there's MemoryStream, but in the end it's just an extra layer over a plain Byte[] array. Additionally, there's now a lot of new toys in c# like ArraySegment or even Span that allow for very fast array processing.

Lately, I've been changing my serialization APIs to look from this:

void Write(Stream s, int v);
int Read(Stream s);

to something like this:

void Write(IList<Byte>, int v);
ArraySegment<Byte> Read(ArraySegment<Byte> ptr, out int v); // returns the advanced ptr

When writing, I write to a List which is much easier to manipulate than a Stream, like, for example writing some bytes and the editing the header with the bytelength. and then it's easy to write the whole list to an array.

For reading, I read all the bytes to a plan Byte[] array, and I use ArraySegment as a sorts of pointer reading.

But if that's too extreme, maybe something like this could do:

interface IStream
{
    void WriteByte(Byte value);
    Byte ReadByte();
}

As a replacement of System.IO.Stream , so developers could roll their own reading/writing mechanisms

invertedtomato commented 5 years ago

Thanks for reaching out Vicente! I really appreciate your thoughts. I'm very interested in working towards a more performant solution (and this library is due for a refresh anyway!).

The problem I've had with void Write(IList, int v) is the requirement for the consumer to pre-allocate a buffer of at least the max encoded length of the value. In trivial cases this is fine, though it gets painful in nontrivial cases https://github.com/invertedtomato/lightweight-serialization.

Have you seen/produced any performance research on MemoryStream, particularly with an initial capacity set? I've operated under the brazen assumption that it should be particularly performant, but I'm glad to have that assumption challenged.

On Wed, 20 Feb 2019 at 01:11, Vicente Penades notifications@github.com wrote:

Hi, I've just started using your library, as you said, Fibonacci is great!, I've got a 30% improvement over my VarInt implementation.

My concern is about the codec API.

The IUnsignedCompressor and IUnsignedDecompressor are very clean and very straightforward... but they rely on System.IO.Stream, which has a lot of internal overhead, and is very slow for many operations.

For serialization and deserialization, in my experince, it is faster to just read or write the full binary blob of a file, and do the encoding/decoding job in memory. Yes, there's MemoryStream, but in the end it's just an extra layer over a plain Byte[] array. Additionally, there's now a lot of new toys in c# like ArraySegment or even Span that allow for very fast array processing.

Lately, I've been changing my serialization APIs to look from this:

void Write(Stream s, int v);int Read(Stream s);

to something like this:

void Write(IList, int v);ArraySegment Read(ArraySegment ptr, out int v); // returns the advanced ptr

When writing, I write to a List which is much easier to manipulate than a Stream, like, for example writing some bytes and the editing the header with the bytelength. and then it's easy to write the whole list to an array.

For reading, I read all the bytes to a plan Byte[] array, and I use ArraySegment as a sorts of pointer reading.

But if that's too extreme, maybe something like this could do:

interface IStream { void WriteByte(Byte value); Byte ReadByte(); }

As a replacement of System.IO.Stream , so developers could roll their own reading/writing mechanisms

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OS4KHCNrnrmBFVdMgnURO82Npt_Hks5vPBQTgaJpZM4bDLfk .

--

*Ben *Thompson

+61 4 1121 5410

vpenades commented 5 years ago

When I use the pattern of void Write(IList<Byte>, int v) I don't need to preallocate the buffer.... writes to the list are done with just a list.Add(byte value) which grows the buffer internally.

Keep in mind that a MemoryStream in write mode operates in essentially the same way as a List; it has an internal byte buffer that keeps growing as you write to the stream.

But the difference is that List has a pretty simple API, and much less internal overhead, no virtual methods, etc. in the other hand, MemoryStream overloads Stream, so it does a lot of additional work internally.

but I think the ideal solution would be to use interface IStream { void WriteByte(Byte value); Byte ReadByte(); } since it can wrap whatever you want... a List, a Stream, etc

I haven't done any research on MemoryStream, but I've seen the internal implemantation, and it's much more complex than the implementation of List<Byte>.Add(Byte value);

MemoryStream.WriteByte source code

List < T > .Add(T value) source code

invertedtomato commented 5 years ago

I've been pondering this one over the last day and came to a rather pragmatic conclusion - just do some tests. I wrote the following:

static void Main(string[] args) { var bytes = 100000000; for (var itr = 0; itr < 3; itr++) { var stopwatch = Stopwatch.StartNew();

  // Populate list
  var list = new List<byte>();
  for (var i = 0; i < bytes; i++) {
     list.Add(0x00);
  }

  // Flatten for output
  var listoutput = list.ToArray();
  Console.WriteLine($"List   {stopwatch.ElapsedMilliseconds}ms");

  stopwatch = Stopwatch.StartNew();
  using (var stream = new MemoryStream()) {
     // Populate stream
     for (var i = 0; i < bytes; i++) {
        stream.WriteByte(0x00);
     }

     // Flatten for output
     var streamoutput = stream.ToArray();
  }
  Console.WriteLine($"Stream {stopwatch.ElapsedMilliseconds}ms");

} }

And on my machine I'm getting the following results:

List 505ms Stream 873ms List 476ms Stream 801ms List 401ms Stream 768ms

Assuming my testing methodology is solid (correct me if you see a flaw in it), those results heartily back up your assertion! At least in terms of running efficiency. Perhaps still MemoryStream is more memory efficient, but that's not a focus for this type of compression.

Okay, I'm with you so far. So on to the implementation! I've no problem with this one - it's clear and to-the-point. Maybe it could return "bytes written", but the scenario where I needed that disappeared in a rewrite, so I'm happy without.

void Encode(IList buffer, ulong value);

This one doesn't seem great though, mostly because there would need to be external converstion between an encode and decode.

ArraySegment Decode(ArraySegment buffer, out ulong value);

And I understand why you'd not rush to propose this one. There's a conversion process after reading.

ulong Decode(IList buffer);

So I checked what that overhead would be:

static void Main(string[] args) { for (var itr = 0; itr < 3; itr++) { var bytes = 100000000; var array = new byte[bytes];

  var stopwatch = Stopwatch.StartNew();
  var list = new List<byte>(array);
  stopwatch.Stop();
  Console.WriteLine($"{stopwatch.ElapsedMilliseconds}ms");

} }

On my machine that's converting 100MBish in around 125ms. It's not much but non-trivial.

175ms 125ms 142ms

And so I end up back with your original proposal. Tweaked a little like this:

void Encode(IByteReaderWriter buffer, ulong value); ulong Decode(IByteReaderWriter buffer);

interface IByteReaderWriter { void WriteByte(Byte value); Byte ReadByte(); }

Give me some time to double check my thoughts, but it's looking like that refresh is in order sooner rather than later!

On Wed, 20 Feb 2019 at 08:51, Vicente Penades notifications@github.com wrote:

When I use the pattern of void Write(IList, int v) I don't need to preallocate the buffer.... writes to the list are done with just a list.Add(byte value) which grows the buffer internally.

Keep in mind that a MemoryStream in write mode operates in essentially the same way as a List; it has an internal byte buffer that keeps growing as you write to the stream.

But the difference is that List has a pretty simple API, and much less internal overhead, no virtual methods, etc. in the other hand, MemoryStream overlads Stream, so it does a lot of additional work internally.

but I think the ideal solution would be to use interface IStream { void WriteByte(Byte value); Byte ReadByte(); } since it can wrap whatever you want... a List, a Stream, etc

I haven't done any research on MemoryStream, but I've seen the internal implemantation, and it's much more complex of the implementation of List.Add(Byte value);

MemoryStream.WriteByte source code https://referencesource.microsoft.com/#mscorlib/system/io/memorystream.cs,634

List < T > .Add(T value source code https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs,220

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2#issuecomment-465344005, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OdjhnSydLa5qL1MAr-b7jgnDafuyks5vPH_lgaJpZM4bDLfk .

--

*Ben *Thompson

+61 4 1121 5410

vpenades commented 5 years ago

Yes, I think using IByteReaderWriterinterface is the best solution.

One of the biggest benefits of List is that it is very fast to "write" to it... but as you mentioned, it has the side effect of requiring copying the content to the actual stream afterwards.

That's why I'm using my own implementation of List<> , that exposes the internal allocation array, so I can stream it directly to a file.

If forgot to mention there's yet another implementation for reading/writing that's even more "abstract" than an interface, which is to use Lambdas:

void Encode( Action<Byte> writeCallback, ulong value);
ulong void Decode( Func<Byte> readCallback);

With such a low level interface, it is very easy to implement any kind of reader and writer, for example:

var buffer = new List<Byte>();
codec.Encode(v => buffer.Add(v), 17);

using(var m = new MemoryStream())
{
codec.Decode( ()=> m.ReadByte() );
}

I usually don't use these methods directly, but whenever I have a complex byte streaming method, and I require to use it both with MemoryStream, and direct memory access, I implement low level methods with lambdas, so they can be easily wrapped with any streaming container.

It just happens I've been exploring all sorts of alternatives to streams on binary serialization/deserialization :)

invertedtomato commented 5 years ago

Alrighty, check this out for size: https://github.com/invertedtomato/integer-compression/blob/master/Library/Compression/Integers/Wave3/FibonacciCodec.cs

I also popped in a wrapper for IList https://github.com/invertedtomato/integer-compression/blob/master/Library/Compression/Integers/Wave3/ListByteWrapper.cs

I haven't written one for ArraySegment as I'm running out of time. I'm planning to get a NuGet release up soon.

On Thu, 21 Feb 2019 at 18:52, Vicente Penades notifications@github.com wrote:

Yes, I think using IByteReaderWriter interface is the best solution.

One of the biggest benefits of List is that it is very fast to "write" to it... but as you mentioned, it has the side effect of requiring copying the content to the actual stream afterwards.

That's why I'm using my own implementation of List<> , that exposes the internal allocation array, so I can stream it directly to a file.

If forgot to mention there's yet another implementation for reading/writing that's even more "abstract" than an interface, which is to use Lambdas:

void Encode( Action writeCallback, ulong value);ulong void Decode( Func readCallback);

With such a low level interface, it is very easy to implement any kind of reader and writer, for example:

var buffer = new List();codec.Encode(v => buffer.Add(v), 17); using(var m = new MemoryStream()) {codec.Decode( ()=> m.ReadByte() ); }

I usually don't use these methods directly, but whenever I have a complex byte streaming method, and I require to use it both with MemoryStream, and direct memory access, I implement low level methods with lambdas, so they can be easily wrapped with any streaming container.

It just happens I've been exploring all sorts of alternatives to streams on binary serialization/deserialization :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2#issuecomment-465914624, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OWaKgh2ylGhSh_Qe5Tl5AJQzD_W-ks5vPl5UgaJpZM4bDLfk .

--

*Ben *Thompson

+61 4 1121 5410

invertedtomato commented 5 years ago

To wrap this up, I've just finished porting all of the old-school Elias and Thompson algorithms over to this new interface. Thanks @vpenades

vpenades commented 5 years ago

Thanks!

As an use case example, I'm using this byte reader:

class ArrayReader : InvertedTomato.Compression.Integers.Wave3.IByteReader
{
            public ArrayReader(Byte[] array)
            {
                Bytes = new ArraySegment<Byte>(array);
            }

            public ArrayReader(ArraySegment<Byte> array)
            {
                Bytes = array;
            }

            private ArraySegment<Byte> Bytes;

            public byte ReadByte()
            {
                var value = Bytes.Array[Bytes.Offset];

                // advance pointer
                Bytes = new ArraySegment<Byte>(Bytes.Array,Bytes.Offset+1,Bytes.Count-1);

                return value;
            }
}