[QUESTION] Best way to store hashes?

dylanstreb commented 3 months ago

I'm currently storing hashes in a litedb for use as a file cache. I'm planning on expanding it, so I was thinking of revisiting the storage. Right now I'm just using base64 encoded strings. I figured that changing the storage to bytes would be more efficient.

So I wrote a quick test program to find out. I made 1,000,000 integers, hashed them, and used a simple stopwatch to compare strings and byte[]. I'm using murmurhash for this, the output is 128 bits. The models are simple:

    public class StringModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public string? Data { get; set; }
    }

    public class ByteModel
    {
        [BsonField]
        [BsonId]
        public int Id { get; set; }

        [BsonField]
        public byte[]? Data { get; set; }
    }

The results were not what I expected:

Took: 18.4086132  byte hashes
Took: 16.8086536  base64 hashes
Bytes database filesize: 122028032
Strings database filesize: 133980160

This is just from doing a InsertBulk on the models. Inserting bytes took longer, which wasn't what I expected. The database is smaller so I assume this isn't doing any kind of binary->hex conversion for storage, but I would naively assume that the smaller bytes record would also insert faster.

Is there a better way to store small binary data in litedb? I'm more concerned about speed than filesize, so should it be left as a base64 string? Or am I setting up the models incorrectly in some way?

azureskydiver commented 3 months ago

v4 had FileStorage available which is presumably better for storing binary blobs: https://github.com/mbdavid/LiteDB/wiki/FileStorage

dylanstreb commented 3 months ago

File storage appears to be for dealing with large files. This is for numerous small blobs - and since they're fixed, known length, in theory it's possible to optimize for this behavior. Doing this manually, i.e. by splitting a 128-bit hash into two 64-bit ints, doesn't help.

Storing the data as a string does seem to be the best option. I'm guessing there are just more optimizations out there (in C# and/or in LiteDB) for string processing than byte[] processing and that's what improves the performance.

azureskydiver commented 3 months ago

Good point. I read "128 KB", not "128 bits". Sorry.

azureskydiver commented 3 months ago

Perhaps I'm testing the wrong way, but I'm getting relatively the same file sizes and insertion times (except for the Base64 string approach) with the following:

using System.Diagnostics;
using LiteDB;

namespace TestLiteDb128
{
    interface IModel
    {
        [BsonIgnore]
        byte[]? Value { get; set; }
    }

    class Base64Model : IModel
    {
        public int Id { get; set; }
        public string? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v == null ? new byte[16] : Convert.FromBase64String(v);

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Convert.ToBase64String(new byte[16])
                                  : Convert.ToBase64String(value);
            }
        }
    }

    class ByteModel : IModel
    {
        public int Id { get; set; }
        public byte[]? v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v;
            set
            {
                Debug.Assert(value?.Length == 16);
                v = new byte[16];
                if (value != null)
                    Array.Copy(value, v, value.Length);
            }
        }
    }

    class LongModel : IModel
    {
        public int Id { get; set; }
        public long lv { get; set; }
        public long hv { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get
            {
                var low = BitConverter.GetBytes(lv);
                var high = BitConverter.GetBytes(hv);
                var ret = new byte[16];
                Array.Copy(low, ret, 8);
                Array.Copy(high, 0, ret, 8, 8);
                return ret;
            }

            set
            {
                Debug.Assert(value?.Length == 16);
                if (value == null)
                {
                    lv = 0;
                    hv = 0;
                }
                else
                {
                    lv = BitConverter.ToInt64(value, 0);
                    hv = BitConverter.ToInt64(value, 8);
                }
            }
        }
    }

    class GuidModel : IModel
    {
        public int Id { get; set; }
        public Guid v { get; set; }

        [BsonIgnore]
        public byte[]? Value
        {
            get => v.ToByteArray();

            set
            {
                Debug.Assert(value?.Length == 16);
                v = value == null ? Guid.Empty : new Guid(value);
            }
        }
    }

    class Program
    {
        static IEnumerable<T> Generate<T>(int count) where T : IModel, new()
        {
            var value = new byte[16];

            for (long i = 0; i < count; i++)
            {
                var low = BitConverter.GetBytes(i);
                Array.Copy(low, value, low.Length);
                T model = new T();
                model.Value = value;
                yield return model;
            }
        }

        static void Test<T>(string filename) where T : IModel, new()
        {
            var stopwatch = new Stopwatch();

            Console.WriteLine($"Generating items for {filename}...");
            stopwatch.Start();
            var items = Generate<T>(1000).ToList();
            stopwatch.Stop();
            Console.WriteLine($"Generated items in {stopwatch.Elapsed}");

            if (File.Exists(filename))
                File.Delete(filename);

            Console.WriteLine($"Filling {filename} ...");
            stopwatch.Reset();
            stopwatch.Start();
            using (var db = new LiteDatabase(filename))
            {
                var col = db.GetCollection<T>();

                foreach(var item in items)
                    col.Insert(item);
            }
            stopwatch.Stop();
            Console.WriteLine($"Filled {filename} in {stopwatch.Elapsed}");
        }

        static void Main(string[] args)
        {
            Test<Base64Model>("Base64Model.db");
            Test<ByteModel>("ByteModel.db");
            Test<LongModel>("LongModel.db");
            Test<GuidModel>("GuidModel.db");
        }
    }
}

(Yes, I know using Benchmark would have been better, but I was just trying to get a rough feel for times and sizes.)

dylanstreb commented 3 months ago

For my test with using two longs, I made a struct instead of putting the longs directly into the model. I'm guessing that's the difference.

GUID I hadn't even considered trying.

mbdavid / LiteDB

[QUESTION] Best way to store hashes? #2443