nostr-protocol / nips

Nostr Implementation Possibilities
2.39k stars 582 forks source link

nip45: add hyperloglog relay response #1561

Open fiatjaf opened 2 weeks ago

fiatjaf commented 2 weeks ago

Here's a nice colorful video explanation of HyperLogLog: https://www.youtube.com/watch?v=lJYufx0bfpw And here's a very interesting article with explanations, graphs and other stuff: http://antirez.com/news/75

If relays implement this we can finally get follower counts that do not suck and without having to use a single relay (aka relay.nostr.band) as the global source of truth for the entire network -- at the same time as we save the world by consuming an incomparably small fraction of the bandwidth.

Even if one was to download 2 reaction events in order to display a silly reaction count number in a UI that would already be using more bytes than this HLL value does (actually considering deflate compression the COUNT response with the HLL value is already smaller than a single reaction EVENT response).

This requires trusting relays to not lie about the counts and the HLL values, but this NIP always required that anyway, so no change there.


HyperLogLog can be implement in multiple ways, with different parameters and whatnot. Luckily most of the customizations (for example, the differences between HyperLogLog++ and HyperLogLog) can be applied at the final step, so it is a client choice. This NIP only describes the part that is needed for interoperability, which is how relays should compute the values and then return them to clients.

Because implementations would have to agree on parameters such as the number of registers to use, this NIP also fixes that number in 256 for simplicity's sake (makes it simpler implement since it's the maximum value of one byte) and also because it is a reasonable amount.


These are some random estimations I did, to showcase how efficient those 256 bytes can be:

real count: 2 estimation: 2
real count: 4 estimation: 4
real count: 6 estimation: 6
real count: 7 estimation: 7
real count: 12 estimation: 12
real count: 15 estimation: 15
real count: 22 estimation: 20
real count: 36 estimation: 36
real count: 44 estimation: 43
real count: 47 estimation: 44
real count: 64 estimation: 65
real count: 77 estimation: 72
real count: 89 estimation: 88
real count: 95 estimation: 93
real count: 104 estimation: 101
real count: 116 estimation: 113
real count: 122 estimation: 131
real count: 144 estimation: 145
real count: 150 estimation: 154
real count: 199 estimation: 196
real count: 300 estimation: 282
real count: 350 estimation: 371
real count: 400 estimation: 428
real count: 500 estimation: 468
real count: 600 estimation: 595
real count: 777 estimation: 848
real count: 922 estimation: 922
real count: 1000 estimation: 978
real count: 1500 estimation: 1599
real count: 2222 estimation: 2361
real count: 9999 estimation: 10650
real count: 13600 estimation: 13528
real count: 80000 estimation: 73439
real count: 133333 estimation: 135973
real count: 200000 estimation: 189470

As you can see they are almost perfect for small counts, but still pretty good for giant counts.

fiatjaf commented 2 weeks ago

Implementation here: https://github.com/nbd-wtf/go-nostr/blob/4532e79d8e52d105888b41d2563c21c8104ea510/nip45/hll.go

alexgleason commented 2 weeks ago

Well, that's pretty mind-blowing. :exploding_head:

:thinking:

It weirdly incentives PoW on the middle of the ID. The relay is responsible for preventing that. Existing measures (WoT) might be enough for some relays. But the default behavior is to trust not just the relay operator but also the users of that relay to not PoW the middle of the ID.

alexgleason commented 2 weeks ago

Why shouldn't the relay just generate a random number every time instead of inspecting the event itself? Then it's not idempotent but it prevents abuse.

fiatjaf commented 2 weeks ago

You can't generate a random number because each distinct item must yield the same number, therefore you need a hash of the item. The event id is already that hash, but yes, people can do PoW to make their reaction count more or something like that.

I considered letting each relay pick a different random offset of the id to count from, which would mitigate this, but in my tests that often overshoots the results by a lot when merging from multiple different sources that have used different offsets.

One thing we can do make all relays use the same offset for each query by making the offset be given by a hash of the <subscription_id>, for example (and then clients have to ensure they use the same subscription id for all relays they query). That is sad because it negates the possibility of having relays that only store an HLL value for each query and discards all events.

Another idea, maybe better, is to use a fixed offset like proposed here, but instead of using the event id, use the author pubkey. Although this makes it so some pubkeys will always have more weight than others when liking posts or following people, which is like creating a funny caste system within Nostr.

fiatjaf commented 2 weeks ago

OK, I think I got the solution: make the offset dependent on the filter, not the subscription id. Something like

Just checking for the first "#e", #p", "#a" or "#q" tag (in this order) will cover 99% of use cases, and if there is no unambiguous way to determine the offset, use a predefined hardcoded value.

v0l commented 2 weeks ago

wow, nice!

vitorpamplona commented 2 weeks ago

Very cool! But I do think this is highly gamefiable because Nostr is so open and the ID is not random at all. Even with the latest filter-based offsets, it's possible to run a PoW procedure on expected filters and significantly change the count for that filter.

The procedure must use the same offset for all relays for the merging math to work.

How about a sha256(id + subID) with the instruction that the subscription ID must be random and the same for all counting relays?

fiatjaf commented 2 weeks ago

If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem.

But I would still like something deterministic to allow for the storageless counter relay. I think the latest procedure plus using the event pubkey instead of event id is sufficient. People can't change their pubkey after the fact just to game one specific like count. And if the relay is going to accept events from completely new and random pubkeys then the counts from that relay are already borked and meaningless anyway.

mikedilger commented 2 weeks ago

This is very cool.

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

mikedilger commented 2 weeks ago

Maybe something like this instead:

byte offset in pubkey = filterbytes[16] AND 31   (16th byte of the filter data, mapped to a number from 0-31).

EDIT: oops, I guess you can't have an offset that far in, or there are no zeroes to count. Anyhow, you get the idea.

People can't change their pubkey after the fact just to game one specific like count

True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to.

I think one more step fixes this. Instead of counting zeroes in the pubkey directly, you count zeroes in the XOR of the pubkey and some hash that is generated from the filter query.

EDIT: and in this case we no longer need an offset. The first 8 bits of the hash are the bucket, the next 248 bits could be a count of zeroes but we really probably shouldn't bother counting past the next 32 bits... or 64 if we are bold. The hash could be sha256() of the filter somehow made reproducable (e.g. not just JSON), maybe the '#e' tag contents, but maybe other parts of the filter matter? limit should probably be discarded.

vitorpamplona commented 2 weeks ago

If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem.

That's why I think the hash is needed (subid may not be random). It would get rid of any PoW made for the event ID and whatever comes in as subID creates enough of a variance that the whole ID changes.

alexgleason commented 2 weeks ago

@vitorpamplona If we take a hash they will just PoW the hash.

mikedilger commented 2 weeks ago

Take a look at HLL-RB (HyperLogLog with random buckets). I think relays may be able to use a random number... even for the same query twice.

EDIT: I think I have been fooled by A.I. I can't find any such thing except as the output of AI. ;-( I was thinking we could use randomness, but that the harmonic mean might no longer be the right way to combine.

fiatjaf commented 2 weeks ago

True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to.

We can get the offset from something like a xor between the pubkey and some part of the filter (I'd rather not take the filter JSON verbatim because that would have been preprocessed already in most cases, it would complicate implementations), then read the stuff from the event id based on that offset.

alexgleason commented 2 weeks ago

I think it's impossible to create a deterministic identifier that can't be mined.

alexgleason commented 2 weeks ago

Anonymous Bitcoiners would offer Nostr SEO packages where you zap them 5000 sats and they do proof of work on hyperloglog reactions they publish so you always have thousands of likes and reposts on everything. I suppose they can do that a lot already by just generating thousands of events.

vitorpamplona commented 2 weeks ago

@vitorpamplona If we take a hash they will just PoW the hash.

Yep, that's why I suggested hash(event id + subid). Sub id just needs to be random enough.

mikedilger commented 2 weeks ago

I think it's impossible to create a deterministic identifier that can't be mined.

We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys.

I think that adding randomness would not mess up the count (on a given relay) but it would make it impossible to combine counts between multiple relays (if you naively did, you would overshoot).

fiatjaf commented 2 weeks ago

I suppose they can do that a lot already by just generating thousands of events.

Exactly, so we are not adding any new weaknesses. Relying on reaction counts from relays that will accept anything is already a very bad idea.

We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys.

Agreed.

fiatjaf commented 2 weeks ago

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the offset, which means we can read that from a single hex character. I know this is all made in a very ad-hoc stupid way, but I like it because it's simple.

mikedilger commented 2 weeks ago

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the offset, which means we can read that from a single hex character. I know this is all made in a very ad-hoc stupid way, but I like it because it's simple.

Ok then I think it is worded in a confusing way. It sounded like you were taking the hex characters as ascii, and doing a mod 24 on that.

mikedilger commented 2 weeks ago

I couldn't get this to work when trying to translate the go code to rust (probably my fault). So I tried reading and implementing the 2007 paper and I got it to work. However, these algorithms differ.

https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (see the top of page 140).

The original 2007 algorithm uses ρ(s) as being the position of the first 1 bit, not the count of leading 0s, meaning the minimum value will be 1. This can't just be corrected by a factor of 2 because it then changes the count of slots that are still 0 (every slot actually used will be at least 1). (original LogLog set slots to -∞ to do this differentiation)

Relays could each use their own different estimation algorithms, but the 256 counts will have to be normalized to either be a count of zeroes, or the position of the first 1 (a count of zeroes +1) in order to be interoperable. And it seems to me that a count of zeroes loses data.

fiatjaf commented 2 weeks ago

There are different implementations out there, I tried to do the simplest possible way that would work and wrote it on the NIP.

We will have to agree on something, we can't have each relay implementing it in a different way.

The original 2007 algorithm uses ρ(s) as being the position of the first 1 bit, not the count of leading 0s, meaning the minimum value will be 1.

This is what my implementation does too, it counts zeroes and adds 1.

mikedilger commented 1 week ago

Because NIP-45 COUNT counts the number of events, but hll counts the number of distinct pubkeys, I think perhaps the key should be "hll_pubkeys" instead of just "hll" to at least give a hint that it isn't counting the events that match the filter, and to leave room for an "hll" that does.

fiatjaf commented 1 week ago

I think in all cases so far what we actually want is distinct pubkeys, so we can be pragmatic about it.

If we ever make a different thing for the the number of distinct ids we can come up with a different name, like hlli, I don't know, these keys are all hardcoded anyway.

I would prefer a JSON array over an object with keys because then we wouldn't have these discussions.

arthurfranca commented 1 week ago

Can't it work for filters with i tags? Like instead of using its raw value, hash it to make it of a fixed char length, then get its 32th char?

arthurfranca commented 1 week ago

What is the expected general relay behavior?

Is the relay upon receiving a reaction supposed to cache one hll value linked to the filter { "kinds": [7], "#e": ["<specific id>"] } and as it receives more reactions with the same e tag it would cache up to 256 hll values for the same filter?

Then it would do the same for kind:1 replies, kind:1111 comments, reposts, zaps and follow lists?

Or should it calculate up to the 256 registers using stored events and start caching just after someone asks for a specific count?

Or should it recalculate everytime and maybe cache just recent ones?

mikedilger commented 1 week ago

What is the expected general relay behavior?

The straightforward solution is: 1) Find all the matching events 2) Process them through the hll algorithm 3) Send a count of them, along with the hll output (512 characters representing 256 small integers). 4) Don't cache anything.

Caching is hard because 1) The 32-byte value of the filter is not the only part of the filter. You can't presume the next filter that passes the hll appropriateness test also has the same other filter fields. And you can't easily hash the entire filter (unless your relay has a deterministic way to represent filters). 2) Every time a new event comes in... for each of the cached hlls.... you have to test the event against that filter (you need to remember the filters) and then update the hll data (itself being the easy and quick part).

IMHO counting should already be very fast since it uses indexes and isn't copying data or allocating memory, especially if the data being counted is mmapped.

fiatjaf commented 1 week ago

Caching is hard

Caching is harder, but I think it can be done specially if you want to store counts for a huge number of events using almost no disk space.

Storing millions of reactions is incredibly wasteful.

arthurfranca commented 1 week ago

For a free public relay, we could assume event posting would simply be rate-limited by IP.

Enabling hll caching comes with the downside that a malicious user can game the count.

The conclusion is that with 256 registers, someone controlling 256 IPs could instantaneously skyrocket any event counter (the cacheable counters - those that aren't filtered by user follows).

With just one IP, within the timeframe a regular count would be gamed to be +256, with hll it would be +millions.

Hope I'm wrong.

Semisol commented 1 week ago

You aren't wrong.

fiatjaf commented 1 week ago

The usage of registers has nothing to do with IP addresses. Any user will be able to skyrocket the number if they can generate infinite number of pubkeys and issue likes/follows/etc with all these pubkeys. If you can generate 3000 pubkeys you can make the follow count go up by ~3000. That is true today, will be true with this change, will always be true. We have no way around it except

  1. downloading a bazillion number of events from public relays and filtering locally using WoT or whatever;
  2. relying on relays that are known to do a good job in filtering stuff -- paid relays and WoT relays from reputable humans, for example -- then merging their HLL values locally.

If HLL values are cached on the relay side or counted at request time is irrelevant for the client and the results will be exactly the same.

arthurfranca commented 1 week ago

Edit: got triggered by a cat avatar and wrote too much. Don't want to flood the PR discussion.

Edit: Lol I thought semisol said I was wrong

mikedilger commented 1 week ago

In order to game this system, you have to produce close to 256 pubkeys each with effective PoW of about 24 (to reach 8 million likes, for example) and these newly mined pubkeys would have to be accepted by the target relay and not seen as spam-pubkeys. IMHO this is not too hard to game. It just takes a little effort and ingenuity.

But how severe is that? So what if it looks like millions of people upvoted you. Reactions are stupid anyways. It's not like they are forging your events or reading your DMs. And if clients look at the first results from the COUNT commands they can quite easily detect the HLL abuse.

Still... as this is quite easy to repair as I've said by having some kind of randomness (perhaps supplied by the client with the COUNT is best to make it consistent across relays without being known to the attacker gaming things). But again the downside is that relays can't cache things anymore (or they can, but probably won't get cache hits anymore) and they have to keep the original reaction events. But at least clients don't need those events anymore.

I don't have a strong opinion either way, but I lean towards having the client-supplied random thing, because I have a penchant for reliability and trustability moreso than efficiency. I think I've said all I can say about this so I'll stop commenting on it until or unless we start debating particulars after this choice is made.

vitorpamplona commented 1 week ago

Why can't the relay count and ALSO run HLL to make sure values match?

Relays MUST guarantee that the HLL is under x% of the actual number and not return an HLL if it goes over. Relays can do whatever they want to do to fix the HLL estimation.

This seems like a better protection than these shenanigans to try to block gamification.

Semisol commented 1 week ago

But having some sort of post-processing of the ID with a client supplied value already fixes this. They would need to also target the attack towards 1 client nonce only which they don't know.

What you said basically makes it able to force clients to fall back to inefficient methods and deny service to the COUNT HLL functionality.

vitorpamplona commented 1 week ago

Yep, but the client cannot fix the HLL if it's wrong. The relay can automatically find which event IDs are creating the distortion and fix it (ignore them) before sending it back.

Semisol commented 1 week ago

Yep, but the client cannot fix the HLL if it's wrong.

With the client nonce system, it does not need to be fixed at all.

The relay can automatically find which event IDs are creating the distortion and fix it (ignore them) before sending it back.

For this reason HLL uses a harmonic mean. For more than a few registers to get an abnormal value would be very hard, unless there were a lot of events (and then the count would be accurate).

Combining the event ID or the pubkey of the author with a client nonce means it is not possible to intentionally bias pubkeys or event IDs

vitorpamplona commented 6 days ago

Combining the event ID or the pubkey of the author with a client nonce means it is not possible to intentionally bias pubkeys or event IDs

But I don't see any real nonce in this spec. Or are we back to using the subscription ID as a random nonce?

Semisol commented 6 days ago

But I don't see any real nonce in this spec. Or are we back to using the subscription ID as a random nonce?

My proposal is to add one.

vitorpamplona commented 6 days ago

I am in favor of the client-based nonce. It's the only way to keep the counter usable.

arthurfranca commented 3 days ago

Came to the conclusion that both approaches should co-exist.

A relay that drops events: ["COUNT", <sub_id>, { "count": 123, "hll_fixed": "<fiatjaf_way>" }]

A relay that doesn't drop events (or drops with lower frequency): ["COUNT", <sub_id>, { "count": 456, "hll_fixed": "<fiatjaf_way>", "hll_dyn": "<vitorpamplona_way>" }]

Client decides which key to use when merging counts depending on what it receives from the visited relays, e.g. if client talked to 3 relays and just one of them filled the hll_dyn field, use the hll_fixed field instead.

fiatjaf commented 3 days ago

Why not 3 ways? 4 ways? 17 ways? Why have a standard at all? Let people do what they want!

vitorpamplona commented 3 days ago

Yeah, I don't know what the double approach offers. If there are outliers, the whole thing is moot. Clients can't do anything about it. It's ok if relays just fix it. We are already trusting them on the count anyway...

arthurfranca commented 1 day ago

I'm aware of the priniciple that there should be preferably only one way of doing things. What having 2 ways (not 17 ways heheh) offer?

My reasoning was that some relays can freely discard events and still provide a less reliable (can be gamed, can't subtract items), though cacheable, hll count. While other relays can dynamically generate a much more reliable hll count, e.g specially on the first days after a nostr event creation when it most likely won't have discarded reactions etc yet. Then after some days, given it may have discarded old reactions etc, these relays can choose to only provide the less reliable hll.

fiatjaf commented 1 day ago

OK, I tried to build the cache thing and it's too complicated. Nostr query system is too flexible. It will never work to support open-ended queries, it will end up using more space than storing the actual events.

It's tempting to just declare caching a failed idea and standardize hyperloglog without that possibility, but it would be so good to not have to store millions of reactions and that stuff!

Now I wonder if we should standardize just some of the "basic" queries to support cached hyperloglog, like number of reactions, number of quotes, number of comments, number of followers -- so these and only these could be cached, and for all the rest (and for all the subqueries of these or other complexities) relays have to either count at runtime or not fulfill at all?

What are the "basic" count queries, besides the ones I listed above?

Or am I crazy to insist in caching?

vitorpamplona commented 1 day ago

am I crazy to insist in caching?

This seems very similar to negentropy. Caching is just not viable.

mikedilger commented 21 hours ago

I think the idea of trying to avoid storing of many low-value events on relays is laudable. I think we should spend more time trying to figure that out.

If we still need to return a normal COUNT though, we will at least need to store the ID prefixes to avoid counting duplicates. And we will need a normal COUNT result in order to detect gaming of the system.

In order for relays to keep these HLL counts without keeping the events, they need to start counting them for well-specified filters before anybody even asks for one. So I agree that we need to specify well-known filters representing things to count (e.g. number of reactions, number of followers). And in that case we cannot hash the event value (pubkey or id) with a user-supplied nonce. And since all parties need to use the same method of generating HLLs for them to combine properly, the system will be well-known and thus gameable. But outliers can be detected and rejected, preventing most gaming.

I'm almost in favor of doing this both ways... but no more. Not 17 ways. I could be persuaded to do it 2.718 ways.

mikedilger commented 21 hours ago

Or instead of keeping all the ID prefixes, you keep a bloom filter and when a "definitely not in set" happens you add 1 but when a "possibly in set" happens you add the false positive rate instead.

Semisol commented 12 hours ago

Or am I crazy to insist in caching?

Kind of. Caching should be done at the filter level (possibly breaking down filters into smaller ones or caching more general filters), not the final result level.