serialization options - Githubissues

Jacknq commented 2 years ago

Hi, I just brainstorming and I ve few questions.. im comparing your serialization options with other libraries. I ve data 117KB, indexed (WithObjectTokenization 3fields) and serialized it binary, writen to file 193KB. Question is could it get even smaller?

and could we serialize it to json? if try JsonSerializer.Serialize it fails with

System.ReadOnlyMemory1[System.Char] is invalid for serialization or deserialization because 
it is a pointer type, is a ref struct, or contains generic parameters that have not been replaced
by specific types.

and last fundamental question would be, if index is always bigger than data and loaded fully to memory or only some part of it (thinking here about future option paging the index and search results - iterating, to not load all of them)

regards

mikegoatly commented 2 years ago

Hi @Jacknq, would it be possible for you to share an example of you code, and some of the data that you're serializing?

To clear up some possible confusion - the serialized version of the index isn't just a dump of the memory structures to disk, it's written out by the BinarySerializer class (The layout format is here). Trying to serialize an index snapshot to JSON with a JsonSerializer will definitely not work (as you've found). There's no reason why you couldn't create an implementation of IIndexSerializer that works through the index data structures and writes out a JSON representation of it, but I would be very surprised if it was any more concise than the binary formatted version. You'd also need to implement the IIndexDeserializer counterpart that's able to parse the JSON back into the relevant structures.

I've just done a quick test on the Wikipedia sample data that's used in some of the unit tests; the raw data is ~6.9Mb of XML and the serialized index comes out at 3.3Mb.

There's probably some optimizations that could be made to the size of the serialized index. For example, there are a number of places where Int32s are written out, and more often than not only the first byte will take up any data. It would be theoretically possible to be smarter about writing this data out, but it does come at the cost of complexity.

Jacknq commented 2 years ago

Ok, in my case it was simple article data - title, content.

I not up to implement serializer, I thought that would be easy. Im generally testing similar libraries, how they solve those things, some do json serialize or get to smaller index size. Also interesting - efficient data iteration and loading minimal data to memory, when dealing with large data.

mikegoatly commented 2 years ago

@Jacknq I'd love to see what the results of your research for the serialized index size of other libraries, if you're willing to share them.

The current version of LIFTI is by design an in-memory index, so having data pages written to disk with only parts loaded into memory is explicitly called out as out of scope. That's not to say it couldn't evolve into that, but it adds a significant amount of complexity and increased potential for the backing store to get corrupted if the host process ends abnormally while its being written to.

Jacknq commented 2 years ago

I dont have any public research or benchmark. I do look around and learn, trying to undestand principles.. Performance is only one side of the hole coin. I love the idea coming from nosql world like litedb and other projects, jumping through the file while not loading all into memory. I like the idea of lunr core of serializing index into json fast (not marginal faster, api looks more general ). I look around other index implentations like bleeve writen in go, searching if there is such thing in .net. There re not many projects on this theme, some are doing quite well. Since Im new to index problematic, I would encourage you try and test them, share your views. And possibly improve. Thanks for your lib.

mikegoatly / lifti

serialization options #51