Open brendanashworth opened 4 years ago
Memory is allocated as though all fields will be filled for every object
This isn't always the case. With the current design all field names for a collection are put into a list in the order they are set. Then each object that uses at least one of the fields in the list will allocate only the space to hold all fields up to the last field in the list needed by the object.
For example, starting with an empty collection.
SET fleet truck1 FIELD speed 45 POINT ...
SET fleet truck2 FIELD direction 127 POINT ...
SET fleet truck3 FIELD age 36 POINT ...
SET fleet truck4 FIELD direction 95 FIELD speed 31 POINT ...
SET fleet truck5 POINT ...
This will ensure that the collection has a list of fields speed,direction,age
, but each object will not necessarily need space for all three fields.
The space used for each object are:
truck1
uses 8 bytes (1 field * sizeof(float64)). The only field is the first field defined in the collection.
speed
45
truck2
uses 16 bytes. It only uses one field, but it's the second one defined in the collection so it needs to allocate space for speed
to reach direction
speed direction
0 127
truck3
uses 24 bytes. Same rules as truck2
.
speed direction age
0 0 36
truck4
uses 16 bytes because it doesn't need anything past direction.
speed direction
31 95
truck5
uses zero bytes because it doesn't use any fields
So in some cases, an object will use less space (not fully saturated) than the total number of fields.
so effectively there's an immense memory waste in my use case.
I feel your pain. As you have noticed, the design isn't ideal when there are many sparse fields.
In your case, you want fields that are most densely used first and the least used last. This provides performance gains due to lower memory usage and better locality. One little trick to ensure that your collection fields are set up that way is to set a placeholder STRING object as your very first object in the collection. For example:
SET fleet field-layout FIELD direction 1 FIELD speed 1 FIELD age 1 STRING placeholder
This will create a placeholder string object which defines the order of the fields, with direction
being the first and age
being the last. But ymmv.
I have an implementation of a different field storage that's far more efficient in the sparse use case ... Would a pull request with these changes be supported?
Sure, I would consider an additional storage model for fields. If it's as good as you say then I think others could benefit too. I would require that it's transparent to the user, thus does not introduce any new features, and that existing "saturated" model stays put. Perhaps with a --sparse-fields
flag or such.
Hi, Thanks for the application, it's really great. For my use case of tile38, I end up with a large number of unique fields set in the same key, where many of those objects don't have the field set (say, 20% of unique fields are popular, whereas 80% of fields are sparse). The current tile38 design for field storage is optimized exclusively for the case where all fields are set. Memory is allocated as though all fields will be filled for every object, so effectively there's an immense memory waste in my use case. Additionally, the default field value of 0 is particularly intuitive for these kinds of queries. It makes querying the set of objects where the field may not be present a pain. I'd like to inquire about whether or not there's interest in supporting the sparse field use case from a performance perspective. I have an implementation of a different field storage that's far more efficient in the sparse use case, but I think there's an inherent trade off between having performance in the sparse case and the saturated case. You must add overhead for sparse fields that would reduce performance in the saturated case. From the benchmarks I've run, performance is order/s of magnitude better for sparse fields, and perhaps ~10% slower for saturated fields. Would a pull request with these changes be supported? Note: the benchmarks I ran were specifically for fields (I think
collection.Set
with many fields), and if I were to guess, the limiting performance factor is elsewhere on queries.