nostr-protocol / nips

Nostr Implementation Possibilities
2.39k stars 582 forks source link

nip45: add hyperloglog relay response #1561

Open fiatjaf opened 2 weeks ago

fiatjaf commented 2 weeks ago

Here's a nice colorful video explanation of HyperLogLog: https://www.youtube.com/watch?v=lJYufx0bfpw And here's a very interesting article with explanations, graphs and other stuff: http://antirez.com/news/75

If relays implement this we can finally get follower counts that do not suck and without having to use a single relay (aka relay.nostr.band) as the global source of truth for the entire network -- at the same time as we save the world by consuming an incomparably small fraction of the bandwidth.

Even if one was to download 2 reaction events in order to display a silly reaction count number in a UI that would already be using more bytes than this HLL value does (actually considering deflate compression the COUNT response with the HLL value is already smaller than a single reaction EVENT response).

This requires trusting relays to not lie about the counts and the HLL values, but this NIP always required that anyway, so no change there.


HyperLogLog can be implement in multiple ways, with different parameters and whatnot. Luckily most of the customizations (for example, the differences between HyperLogLog++ and HyperLogLog) can be applied at the final step, so it is a client choice. This NIP only describes the part that is needed for interoperability, which is how relays should compute the values and then return them to clients.

Because implementations would have to agree on parameters such as the number of registers to use, this NIP also fixes that number in 256 for simplicity's sake (makes it simpler implement since it's the maximum value of one byte) and also because it is a reasonable amount.


These are some random estimations I did, to showcase how efficient those 256 bytes can be:

real count: 2 estimation: 2
real count: 4 estimation: 4
real count: 6 estimation: 6
real count: 7 estimation: 7
real count: 12 estimation: 12
real count: 15 estimation: 15
real count: 22 estimation: 20
real count: 36 estimation: 36
real count: 44 estimation: 43
real count: 47 estimation: 44
real count: 64 estimation: 65
real count: 77 estimation: 72
real count: 89 estimation: 88
real count: 95 estimation: 93
real count: 104 estimation: 101
real count: 116 estimation: 113
real count: 122 estimation: 131
real count: 144 estimation: 145
real count: 150 estimation: 154
real count: 199 estimation: 196
real count: 300 estimation: 282
real count: 350 estimation: 371
real count: 400 estimation: 428
real count: 500 estimation: 468
real count: 600 estimation: 595
real count: 777 estimation: 848
real count: 922 estimation: 922
real count: 1000 estimation: 978
real count: 1500 estimation: 1599
real count: 2222 estimation: 2361
real count: 9999 estimation: 10650
real count: 13600 estimation: 13528
real count: 80000 estimation: 73439
real count: 133333 estimation: 135973
real count: 200000 estimation: 189470

As you can see they are almost perfect for small counts, but still pretty good for giant counts.

Semisol commented 7 hours ago

Storing reactions is not a big issue: dedicated indexes and encoding schemes can be made for high-volume optimizable events, which will mean you can store about 5M reactions in 1GB. This does not include some other methods you could use such as public key lookup tables that can cut the size even further.

Relays may also implement sampling and adjust the HLL result accordingly (as it is likely that the relay will know a rough count of the amount of events it may have to explore): you don't need to add every event to the HLL sketch, only some, and add a correction factor to every register.

fiatjaf commented 4 hours ago

dedicated indexes and encoding schemes can be made for high-volume optimizable events

HLL is exactly such a thing already and defining what goes into this dedicated scheme and what is ignored is the question I posed above: if you want a limited functionality for specific use cases then HLL caching can be very good, otherwise nothing is possible.

fiatjaf commented 4 hours ago

Anyway, I think one solution is to define HLL to be returned ONLY in the following queries (exact templates), at least for now:

All other queries should not return HLL responses.

And then whenever someone has another use case we add it to the list.

In the queries above it is declared that <anchor> will be used to determine how to produce the HLL value for each related event deterministically.

Semisol commented 31 minutes ago

Anyway, I think one solution is to define HLL to be returned ONLY in the following queries (exact templates), at least for now:

* `{"#e":["<anchor>"],"kinds":[7]}`

* `{"#p":["<anchor>"],"kinds":[3]}`

All other queries should not return HLL responses.

And then whenever someone has another use case we add it to the list.

In the queries above it is declared that <anchor> will be used to determine how to produce the HLL value for each related event deterministically.

You have not solved the problem that this is open to manipulation

Semisol commented 30 minutes ago

otherwise nothing is possible.

A lot of things are possible.