Closed vpenades closed 5 years ago
Thanks for reaching out Vicente! I really appreciate your thoughts. I'm very interested in working towards a more performant solution (and this library is due for a refresh anyway!).
The problem I've had with void Write(IList
Have you seen/produced any performance research on MemoryStream, particularly with an initial capacity set? I've operated under the brazen assumption that it should be particularly performant, but I'm glad to have that assumption challenged.
On Wed, 20 Feb 2019 at 01:11, Vicente Penades notifications@github.com wrote:
Hi, I've just started using your library, as you said, Fibonacci is great!, I've got a 30% improvement over my VarInt implementation.
My concern is about the codec API.
The IUnsignedCompressor and IUnsignedDecompressor are very clean and very straightforward... but they rely on System.IO.Stream, which has a lot of internal overhead, and is very slow for many operations.
For serialization and deserialization, in my experince, it is faster to just read or write the full binary blob of a file, and do the encoding/decoding job in memory. Yes, there's MemoryStream, but in the end it's just an extra layer over a plain Byte[] array. Additionally, there's now a lot of new toys in c# like ArraySegment or even Span that allow for very fast array processing.
Lately, I've been changing my serialization APIs to look from this:
void Write(Stream s, int v);int Read(Stream s);
to something like this:
void Write(IList
, int v);ArraySegment Read(ArraySegment ptr, out int v); // returns the advanced ptr When writing, I write to a List which is much easier to manipulate than a Stream, like, for example writing some bytes and the editing the header with the bytelength. and then it's easy to write the whole list to an array.
For reading, I read all the bytes to a plan Byte[] array, and I use ArraySegment as a sorts of pointer reading.
But if that's too extreme, maybe something like this could do:
interface IStream { void WriteByte(Byte value); Byte ReadByte(); }
As a replacement of System.IO.Stream , so developers could roll their own reading/writing mechanisms
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OS4KHCNrnrmBFVdMgnURO82Npt_Hks5vPBQTgaJpZM4bDLfk .
--
*Ben *Thompson
+61 4 1121 5410
When I use the pattern of void Write(IList<Byte>, int v)
I don't need to preallocate the buffer.... writes to the list are done with just a list.Add(byte value)
which grows the buffer internally.
Keep in mind that a MemoryStream in write mode operates in essentially the same way as a List
But the difference is that List
but I think the ideal solution would be to use interface IStream { void WriteByte(Byte value); Byte ReadByte(); }
since it can wrap whatever you want... a List
I haven't done any research on MemoryStream, but I've seen the internal implemantation, and it's much more complex than the implementation of List<Byte>.Add(Byte value);
I've been pondering this one over the last day and came to a rather pragmatic conclusion - just do some tests. I wrote the following:
static void Main(string[] args) { var bytes = 100000000; for (var itr = 0; itr < 3; itr++) { var stopwatch = Stopwatch.StartNew();
// Populate list
var list = new List<byte>();
for (var i = 0; i < bytes; i++) {
list.Add(0x00);
}
// Flatten for output
var listoutput = list.ToArray();
Console.WriteLine($"List {stopwatch.ElapsedMilliseconds}ms");
stopwatch = Stopwatch.StartNew();
using (var stream = new MemoryStream()) {
// Populate stream
for (var i = 0; i < bytes; i++) {
stream.WriteByte(0x00);
}
// Flatten for output
var streamoutput = stream.ToArray();
}
Console.WriteLine($"Stream {stopwatch.ElapsedMilliseconds}ms");
} }
And on my machine I'm getting the following results:
List 505ms Stream 873ms List 476ms Stream 801ms List 401ms Stream 768ms
Assuming my testing methodology is solid (correct me if you see a flaw in it), those results heartily back up your assertion! At least in terms of running efficiency. Perhaps still MemoryStream is more memory efficient, but that's not a focus for this type of compression.
Okay, I'm with you so far. So on to the implementation! I've no problem with this one - it's clear and to-the-point. Maybe it could return "bytes written", but the scenario where I needed that disappeared in a rewrite, so I'm happy without.
void Encode(IList
This one doesn't seem great though, mostly because there would need to be external converstion between an encode and decode.
ArraySegment
And I understand why you'd not rush to propose this one. There's a conversion process after reading.
ulong Decode(IList
So I checked what that overhead would be:
static void Main(string[] args) { for (var itr = 0; itr < 3; itr++) { var bytes = 100000000; var array = new byte[bytes];
var stopwatch = Stopwatch.StartNew();
var list = new List<byte>(array);
stopwatch.Stop();
Console.WriteLine($"{stopwatch.ElapsedMilliseconds}ms");
} }
On my machine that's converting 100MBish in around 125ms. It's not much but non-trivial.
175ms 125ms 142ms
And so I end up back with your original proposal. Tweaked a little like this:
void Encode(IByteReaderWriter buffer, ulong value); ulong Decode(IByteReaderWriter buffer);
interface IByteReaderWriter { void WriteByte(Byte value); Byte ReadByte(); }
Give me some time to double check my thoughts, but it's looking like that refresh is in order sooner rather than later!
On Wed, 20 Feb 2019 at 08:51, Vicente Penades notifications@github.com wrote:
When I use the pattern of void Write(IList
, int v) I don't need to preallocate the buffer.... writes to the list are done with just a list.Add(byte value) which grows the buffer internally. Keep in mind that a MemoryStream in write mode operates in essentially the same way as a List; it has an internal byte buffer that keeps growing as you write to the stream.
But the difference is that List has a pretty simple API, and much less internal overhead, no virtual methods, etc. in the other hand, MemoryStream overlads Stream, so it does a lot of additional work internally.
but I think the ideal solution would be to use interface IStream { void WriteByte(Byte value); Byte ReadByte(); } since it can wrap whatever you want... a List, a Stream, etc
I haven't done any research on MemoryStream, but I've seen the internal implemantation, and it's much more complex of the implementation of List
.Add(Byte value); MemoryStream.WriteByte source code https://referencesource.microsoft.com/#mscorlib/system/io/memorystream.cs,634
List < T > .Add(T value source code https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs,220
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2#issuecomment-465344005, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OdjhnSydLa5qL1MAr-b7jgnDafuyks5vPH_lgaJpZM4bDLfk .
--
*Ben *Thompson
+61 4 1121 5410
Yes, I think using IByteReaderWriter
interface is the best solution.
One of the biggest benefits of List
That's why I'm using my own implementation of List<> , that exposes the internal allocation array, so I can stream it directly to a file.
If forgot to mention there's yet another implementation for reading/writing that's even more "abstract" than an interface, which is to use Lambdas:
void Encode( Action<Byte> writeCallback, ulong value);
ulong void Decode( Func<Byte> readCallback);
With such a low level interface, it is very easy to implement any kind of reader and writer, for example:
var buffer = new List<Byte>();
codec.Encode(v => buffer.Add(v), 17);
using(var m = new MemoryStream())
{
codec.Decode( ()=> m.ReadByte() );
}
I usually don't use these methods directly, but whenever I have a complex byte streaming method, and I require to use it both with MemoryStream, and direct memory access, I implement low level methods with lambdas, so they can be easily wrapped with any streaming container.
It just happens I've been exploring all sorts of alternatives to streams on binary serialization/deserialization :)
Alrighty, check this out for size: https://github.com/invertedtomato/integer-compression/blob/master/Library/Compression/Integers/Wave3/FibonacciCodec.cs
I also popped in a wrapper for IList https://github.com/invertedtomato/integer-compression/blob/master/Library/Compression/Integers/Wave3/ListByteWrapper.cs
I haven't written one for ArraySegment as I'm running out of time. I'm planning to get a NuGet release up soon.
On Thu, 21 Feb 2019 at 18:52, Vicente Penades notifications@github.com wrote:
Yes, I think using IByteReaderWriter interface is the best solution.
One of the biggest benefits of List is that it is very fast to "write" to it... but as you mentioned, it has the side effect of requiring copying the content to the actual stream afterwards.
That's why I'm using my own implementation of List<> , that exposes the internal allocation array, so I can stream it directly to a file.
If forgot to mention there's yet another implementation for reading/writing that's even more "abstract" than an interface, which is to use Lambdas:
void Encode( Action
writeCallback, ulong value);ulong void Decode( Func readCallback); With such a low level interface, it is very easy to implement any kind of reader and writer, for example:
var buffer = new List
();codec.Encode(v => buffer.Add(v), 17); using(var m = new MemoryStream()) {codec.Decode( ()=> m.ReadByte() ); } I usually don't use these methods directly, but whenever I have a complex byte streaming method, and I require to use it both with MemoryStream, and direct memory access, I implement low level methods with lambdas, so they can be easily wrapped with any streaming container.
It just happens I've been exploring all sorts of alternatives to streams on binary serialization/deserialization :)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/invertedtomato/integer-compression/issues/2#issuecomment-465914624, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq6OWaKgh2ylGhSh_Qe5Tl5AJQzD_W-ks5vPl5UgaJpZM4bDLfk .
--
*Ben *Thompson
+61 4 1121 5410
To wrap this up, I've just finished porting all of the old-school Elias and Thompson algorithms over to this new interface. Thanks @vpenades
Thanks!
As an use case example, I'm using this byte reader:
class ArrayReader : InvertedTomato.Compression.Integers.Wave3.IByteReader
{
public ArrayReader(Byte[] array)
{
Bytes = new ArraySegment<Byte>(array);
}
public ArrayReader(ArraySegment<Byte> array)
{
Bytes = array;
}
private ArraySegment<Byte> Bytes;
public byte ReadByte()
{
var value = Bytes.Array[Bytes.Offset];
// advance pointer
Bytes = new ArraySegment<Byte>(Bytes.Array,Bytes.Offset+1,Bytes.Count-1);
return value;
}
}
Hi, I've just started using your library, as you said, Fibonacci is great!, I've got a 30% improvement over my VarInt implementation.
My concern is about the codec API.
The IUnsignedCompressor and IUnsignedDecompressor are very clean and very straightforward... but they rely on System.IO.Stream, which has a lot of internal overhead, and is very slow for many operations.
For serialization and deserialization, in my experince, it is faster to just read or write the full binary blob of a file, and do the encoding/decoding job in memory. Yes, there's MemoryStream, but in the end it's just an extra layer over a plain or even Span that allow for very fast array processing.
Byte[]
array. Additionally, there's now a lot of new toys in c# like ArraySegmentLately, I've been changing my serialization APIs to look from this:
to something like this:
When writing, I write to a List which is much easier to manipulate than a Stream, like, for example writing some bytes and the editing the header with the bytelength. and then it's easy to write the whole list to an array.
For reading, I read all the bytes to a plan Byte[] array, and I use ArraySegment as a sorts of pointer reading.
But if that's too extreme, maybe something like this could do:
As a replacement of System.IO.Stream , so developers could roll their own reading/writing mechanisms