nostr-protocol / nips

Nostr Implementation Possibilities
2.37k stars 570 forks source link

NIP-XX: `TOP` verb #643

Open CodyTseng opened 1 year ago

CodyTseng commented 1 year ago

Recently I want to implement NIP-50 on my relay, so I read and reread NIP-50 many times. Then I think it's not good to add search capability to the REQ message. The main reasons are as follows:

For the above reasons, I think it would be better to add a new verb SEARCH. It can avoid increasing the complexity of REQ, and can also better formulate the protocol for search feature.

Not sure if my understanding is correct, I'd like to hear your opinions.

staab commented 1 year ago

Yes, I agree, see nip 45 in which COUNT is defined using a separate verb. Switching would be a breaking change, but search is not well supported so it could be feasible.

CodyTseng commented 1 year ago

Yes, I agree, see nip 45 in which COUNT is defined using a separate verb. Switching would be a breaking change, but search is not well supported so it could be feasible.

I think it would be easier to implement after separate out the search capability, and solve the situation that most relays don't support search now.

fiatjaf commented 1 year ago

@darashi @brugeman

darashi commented 1 year ago

I think it is a good idea to have a separate SEARCH verb to implement relevance-ordered searches. It would be difficult to implement it with keeping consistency with REQ specification.

• It would be better to sort the search results by relevance, but REQ is sorted by created_at. If the results are sorted by relevance, we need to introduce offset.

Considering NIP-23, it would be nice to be able to sort by relevance. I’m not sure if we can handle offset well. Maybe something like a cursor would be better, but it would be hard to implement. Especially if the client connects to multiple relays with independent ranking algorithms.

• Search usually doesn't care about future events. It is also difficult for the relay to directly judge whether the new event matches key words.

In my opinion, there are use cases for time-ordered searches, and it is useful for future events to be streamed. I built https://nos.today/. It is automatically updated as new notes matching queries come in. The current NIP-50 specification is perfect for this use case.

As one of backends of nos.today, it uses my NIP-50 implementation https://github.com/darashi/searchnos, which is powered by Elasticsearch. After EOSE, it polls Elasticsearch and streams the results. In principle, it should also be possible to search against a stream (in which case there is no need for an index like Elasticsearch, which could be rather compute resource friendly)

• To achieve a good search effect, it is usually necessary to use a search engine. If the search capability is added to REQ, the relay needs to obtain data from two different types of databases, which undoubtedly increases the complexity and burden of the relay.

Searchnos intentionally uses only one type of database (Elasticsearch). Internally, it has the ability to handle all of the filters for REQs with Elasticsearch index (if my implementation is correct). However, as a relay, it is quite slower than other implementations, so it ignores all filters that do not include search. Usually, users use multiple relays, so this is not a problem. It’s not a very pretty solution, though. It is definitely a challenge to implement a single relay to do both.

CodyTseng commented 1 year ago

@darashi Thank you very much for your detailed explanation of your views and I have learned a lot from your views.

Considering NIP-23, it would be nice to be able to sort by relevance. I’m not sure if we can handle offset well. Maybe something like a cursor would be better, but it would be hard to implement. Especially if the client connects to multiple relays with independent ranking algorithms.

I don't think offset is a good idea either, but it's also hard to do with a cursor. The way I can think of using a cursor is to query a large number of events at once, then cache the IDs of all the events and wait for the next query. The client uses ID as a cursor to get the next page. This would put a lot of pressure on the relay. In the end, I think filtering by since and util is still a good idea. The client can use since and util to search events with the highest relevance over different time periods. The size of the time interval can be adjusted to retrieve more or fewer events with a lower relevance. Of course, this approach will lose some events with a lower relevance, but I think it is an acceptable cost.

In my opinion, there are use cases for time-ordered searches, and it is useful for future events to be streamed. I built nos.today. It is automatically updated as new notes matching queries come in. The current NIP-50 specification is perfect for this use case.

As one of backends of nos.today, it uses my NIP-50 implementation darashi/searchnos, which is powered by Elasticsearch. After EOSE, it polls Elasticsearch and streams the results. In principle, it should also be possible to search against a stream (in which case there is no need for an index like Elasticsearch, which could be rather compute resource friendly)

nos.today is a great project, which has made me realize some needs that I hadn't thought of before. I will continue to follow the progress of this project and hope to have the opportunity to participate in it in the future. Now I think we may also need an option for the client to choose the sorting method for search results, either by created_at or by relevance.

Searchnos intentionally uses only one type of database (Elasticsearch). Internally, it has the ability to handle all of the filters for REQs with Elasticsearch index (if my implementation is correct). However, as a relay, it is quite slower than other implementations, so it ignores all filters that do not include search. Usually, users use multiple relays, so this is not a problem. It’s not a very pretty solution, though. It is definitely a challenge to implement a single relay to do both.

If the search capability is separated, adding search capability to Relay will not make the logic of REQ complex. I agree with your point that it is definitely a challenge to implement a single relay to do both.

brugeman commented 1 year ago

I don't agree that a separate SEARCH verb is necessary.

It would be better to sort the search results by relevance, but REQ is sorted by created_at. If the results are sorted by relevance, we need to introduce offset.

I thought so before, and now I think that's a bad idea. I think that a scaleable search with a sophisticated ranking function would only allow you to retrieve top N results, because letting clients scan the full matched document set in ranked order would have astronomical costs. So instead of having a separate SEARCH verb with custom cursors for pagination, I'm now leaning to a TOP verb that returns N most relevant event refs (say 1000 top-ranked event ids), and clients would load the matching events using "ids" REQ filter in small batches. TOP could work with all kinds of REQ filters, not just NIP-50 ones.

Search usually doesn't care about future events. It is also difficult for the relay to directly judge whether the new event matches key words.

I don't think this is the case, NIP-50 chronological feed doesn't differ much in principle from any other feed, why would clients be less interested in receiving future events?

To achieve a good search effect, it is usually necessary to use a search engine. If the search capability is added to REQ, the relay needs to obtain data from two different types of databases, which undoubtedly increases the complexity and burden of the relay.

If SEARCH verb did exist and supported all filters that REQ supports than wouldn't you have to have two relay database implementations? One with search, and one without it? Don't see how it differs much from a "search" filter field. At nostr.band we just have a single custom relay database that serves all REQs.

If SEARCH didn't support all REQ filters then that's a completely different story, please tell us more if that's on your mind.

CodyTseng commented 1 year ago

Hey @brugeman, I agree that letting clients to fully scan all matching documents in ranked order is not feasible. This is exactly the issue I want to discuss. In my response to darashi, I mentioned a TOP-like solution, but my initial idea was to store the matching IDs in the relay, which is obviously a foolish idea. TOP returns N most relevant event refs to clients and then clients load the matching events using "ids" REQ filter in small batches, which sounds like a good solution. My original goal was to provide the ability to search for the events with the highest relevance. I think TOP can solve this problem, and it would be even better if TOP supported some simple filters (such as kind and created_at).

If SEARCH verb did exist and supported all filters that REQ supports than wouldn't you have to have two relay database implementations? One with search, and one without it? Don't see how it differs much from a "search" filter field. At nostr.band we just have a single custom relay database that serves all REQs.

If SEARCH exists, it does require two different database implementations. However, SEARCH is a new thing, it can skip implementing some REQ filters. Most search engines have weak filtering capabilities, but their advantage is that they can quickly perform full-text searches. For example, most search engines do not support filtering by prefixes. But if I can customize a database like what the nostr.band team did, then none of these are problems 🤙

Finally, I think TOP is more in line with the way search engines are used, and we can definitely continue to move in this direction.

darashi commented 1 year ago

At nostr.band we just have a single custom relay database that serves all REQs.

I tried to implement and run my relay and found that to be an amazing job!

CodyTseng commented 1 year ago

At nostr.band we just have a single custom relay database that serves all REQs.

I tried to implement and run my relay and found that to be an amazing job!

How did you do it, can give me some points?

CodyTseng commented 1 year ago

I have implemented the TOP verb on my relay according to @brugeman's idea. I have synchronized the 30023 events from relay.damus.io and nos.lol to my relay for testing. Anyone interested can come and try it (wss://nostr-relay.app). Suggestions are welcome.

TOP verb accepts a query string and a filter as specified in NIP 01 for the verb REQ.

["TOP",<query string>,<filter JSON>]

And return the top N event IDs with the highest relevance, sorted by relevance.

["TOP",<query string>,<event id array>]

Example:

["TOP","nostr bitcoin",{"kinds":[30023],"limit":10}]

["TOP","nostr bitcoin",["2359f4bdfe0bd2353aa7702dc1af23279197694823b8b4916b904a9940334192","622a875c9f9a4696eb4050fa5b0bba3a9b0531ec4a27398245af7369e6d40da8","d8989c65d26511b2e3ea42b0ebfcaf0ea885cb958419df4ddb334cb72556f950","ffcb0c9e0ace0b5d3928f30395bc9832763f8b583f2b1beb696f7c199f9f94d2","287147867bd00299553fa91e110d40206eea19a9142a4283832ee67e1407e6f2","ffaea8bc3b08db32af97f1ff595e68eee8a2f7b0a4a66dc2eff330f450855f6c","cddbc6cd4a0589d4a593e99a3a94426c85c6867b47d7eb751ce419c27f079b76","f2291ac6d206e898965b9e4ba6bbe5bb10118e6a74bd9f9f13597813979a254b","a101a2a44938dbb0a611bc00bd7ed4cb44d682fea4c14618bd1148567cd6fcc3","21990a723b491b6c594438a2ecf5d5e4898212635f59e82f1c736d994a86e907"]]
nostrband commented 1 year ago

Why put the query where a sub_id belongs? Why not put it into the filters using NIP-50 search field?

["TOP",<sub_id>,<filter JSON>]
["TOP","s123",{"kinds":[30023],"search":"nostr bitcoin","limit":10}]
["TOP","s123",["2359f4bdfe0bd2353aa7702dc1af23279197694823b8b4916b904a9940334192","622a875c9f9a4696eb4050fa5b0bba3a9b0531ec4a27398245af7369e6d40da8","d8989c65d26511b2e3ea42b0ebfcaf0ea885cb958419df4ddb334cb72556f950","ffcb0c9e0ace0b5d3928f30395bc9832763f8b583f2b1beb696f7c199f9f94d2","287147867bd00299553fa91e110d40206eea19a9142a4283832ee67e1407e6f2","ffaea8bc3b08db32af97f1ff595e68eee8a2f7b0a4a66dc2eff330f450855f6c","cddbc6cd4a0589d4a593e99a3a94426c85c6867b47d7eb751ce419c27f079b76","f2291ac6d206e898965b9e4ba6bbe5bb10118e6a74bd9f9f13597813979a254b","a101a2a44938dbb0a611bc00bd7ed4cb44d682fea4c14618bd1148567cd6fcc3","21990a723b491b6c594438a2ecf5d5e4898212635f59e82f1c736d994a86e907"]]
nostrband commented 1 year ago

Also, why not an array of filters like in REQ? ["TOP","s123",[{"kinds":[30023],"search":"nostr bitcoin","limit":10}]]

CodyTseng commented 1 year ago

Thank you for your valuable suggestions.

Also, why not an array of filters like in REQ? ["TOP","s123",[{"kinds":[30023],"search":"nostr bitcoin","limit":10}]]

Because I haven't figured out what format the result should be if multiple filters are supported. Multi ["TOP","s123",[ids]] or ["TOP","s123",[ids],[ids],[ids]]

Why put the query where a sub_id belongs? Why not put it into the filters using NIP-50 search field?

If multiple filters are not supported, the query string can be equivalent to sub_id. Of course, I think it can also be put into the filter using search field. Finally, the search field in the filter takes precedence.

["TOP",<sub_id or search>,<filter JSON>]

If there is a need to support multiple filters, the position of sub_id should indeed not be replaced.

nostrband commented 1 year ago

Because I haven't figured out what format the result should be if multiple filters are supported. Multi ["TOP","s123",[ids]] or ["TOP","s123",[ids],[ids],[ids]]

I'm guessing the result format doesn't change - it could work like REQ does, all results from all filters are merged, sorted, and returned as a single TOP list of event ids. REQ does sorting by created_at, TOP simply uses a different sort algo (in the abstract - the actual implementation is very different, of course).

If multiple filters are not supported, the query string can be equivalent to sub_id.

I don't think so - query-string would then have to include kinds and other filters. Otherwise I won't be able to send TOP w/ kind:1 and TOP w/ kind:30023 w/ same search field in parallel. Or rather - I wouldn't be able to distinguish the results:

["TOP","nostr bitcoin",{"kinds":[30023],"limit":10}]
["TOP","nostr bitcoin",{"kinds":[1],"limit":10}]
["TOP","nostr bitcoin",["2359f4bdfe0bd2353aa7702dc1af23279197694823b8b4916b904a9940334192"]]
["TOP","nostr bitcoin",["21990a723b491b6c594438a2ecf5d5e4898212635f59e82f1c736d994a86e907"]]

Besides, TOP has nothing special to do w/ searching, it could work with any filters, in which case again - placing special meaning to sub_id doesn't seem right.

CodyTseng commented 1 year ago

TOP simply uses a different sort algo (in the abstract - the actual implementation is very different, of course).

I don't quite understand this approach. There may be no correlation between different search results, right? I'll learn more about this.

I don't think so - query-string would then have to include kinds and other filters. Otherwise I won't be able to send TOP w/ kind:1 and TOP w/ kind:30023 w/ same search field in parallel. Or rather - I wouldn't be able to distinguish the results:

You are right! I will modify my code.

nostrband commented 1 year ago

I don't quite understand this approach. There may be no correlation between different search results, right? I'll learn more about this.

The first thing that comes to mind is searching through kind:1 and kind:1311 and kind:42 - these are all similar things (short messages), so I might wish to not only search through them all, but sort them properly - we can't do that on the client without sending the ranking scores to the client. So sending several filters in the TOP and merging results on the server like REQ would do makes sense, does it?

CodyTseng commented 1 year ago

The first thing that comes to mind is searching through kind:1 and kind:1311 and kind:42 - these are all similar things (short messages), so I might wish to not only search through them all, but sort them properly - we can't do that on the client without sending the ranking scores to the client. So sending several filters in the TOP and merging results on the server like REQ would do makes sense, does it?

If the search keywords are the same, it is indeed possible to merge the search results like REQ and sort them by relevance. However, if the search keywords are different, the meaning of merging may not be significant.

How do you think about this:

["TOP",<sub_id>,<query_string>,<filters JSON>...]
nostrband commented 1 year ago

If the search keywords are the same, it is indeed possible to merge the search results like REQ and sort them by relevance.

You're thinking TOP relates to search, I'm just thinking TOP means better stuff, with meaning of better being relevance (keyword specific) or trust rank (non-keyword specific) or whatever. Each relay would have it's own meaning and could advertise it, just like they would with their search relevance function.

However, if the search keywords are different, the meaning of merging may not be significant.

If someone sends meaningless set of filters in TOP and gets meaningless results - it's their fault, I don't think you can prevent that, and it's not specific to keywords.

How do you think about this: ["TOP",,,...]

I don't think it's necessary to tie TOP w/ keywords.

CodyTseng commented 1 year ago

You're thinking TOP relates to search, I'm just thinking TOP means better stuff, with meaning of better being relevance (keyword specific) or trust rank (non-keyword specific) or whatever. Each relay would have it's own meaning and could advertise it, just like they would with their search relevance function.

I understand your idea now, it's cool and I shouldn't have limited TOP to just searching. Different relays can have their own scoring mechanisms. I learned a lot today, thank you.