`claim_search` Plugin API

eukreign commented 4 years ago

In order to improve the content discovery and search capabilities of the LBRY network the SDK needs a way to use 3rd party search services.

Version 1 of the proposed API will be JSON RPC to aid in development and debugging, once the design has proven itself in practice and the data structures exchanged have solidified we may consider switching to a binary protocol such as protobufs.

SDK will require two end points exposed by a search service in order to be compatible; they are described below:

`search_features`

This end point will only be called once on SDK startup to find out the features offered by the search service. Expected response string values with brackets are meant to be filled in:

Request: Plain GET with no arguments. Response:

{
  "id": "[short name: 'lighthouse']",
  "version": "[version: '1.0']",
  "name": "[friendly label: 'LBRY Lighthouse Search']",
  "configuration": {  # any configs affecting search results users should know about
    "[config field]": "[config value]",
  },
  "filter": {  # the filter arguments which can be passed to the search end point
    "[field name]": {
      "type": "[field data type: string, integer, date, etc]",
      "constraints": ["comparison", "fts", "range", "etc"],
  },
  "order_by": {  # the order_by arguments which can be passed to the search end point
      "type": "[field data type: string, integer, etc]"
  },
  "metadata": {  # extra metadata returned by search results for every single claim
      "type": "[field data type: string, integer, etc]",
  }
}

id: SDK will use the {id}-{version}-{hash(configuration)} to uniquely identify a search service internally.
name: potentially could be displayed in a UI where user can pick from several search services
configuration: SDK does not expect any structure for this, it's mostly informational and will be passed to clients as-is, it may be used for A/B testing or other metrics
filter: SDK will enforce that client doing searches only use these filters, constraints may be left off and defaults to just exact equality comparison, comparison means that values will be in the format of ">332", "=<32", etc and fts would mean that this field can support advanced full text search operations, etc. range takes comma delimited values for high, low values "39,22"
order_by: SDK will enforce that for client requests
metadata: additional metadata search service provides ontop of what the SDK includes in the standard search responses already (for example, this may be some full-text-search metrics or content trending information, etc), it will simply be passed to clients as-is

`search`

This end point performs the actual search. It accepts a request with filters, order_by and pagination parameters (limit and offset) and responds with claims_ids and any extra metadata.

Request:

{
  "filter": {
    "[field name]": "[value]",
  },
  "order_by": [["[field1]", "desc"], ["[field2]", "asc"]],
  "offset": 0,
  "limit": 20
}

Response:

[  # claim_id is the only required value to be in the result
  {"claim_id": "[claim_id]", "[metadata field 1]": "[metadata value 1]", ...}
]

Work Flow

On startup SDK will call search_features and cache the result for the duration of the running process.
As clients connect to the SDK, the SDK will respond with the available search services and their features and configs as was reported by search_features.
As SDK receives claim_search requests from clients it will validate that the filter fields and order_by fields are accepted by the search service.
SDK forwards a clients search request to the search service passing appropriate limit/offset.
For each claim returned by search service the SDK will check the claim against the block/filter lists and if any of the claims are blocked it will increase the offset value and send another search request, it will continue to request search and verify result against block/filter lists until either search service no longer returns claims or until the page_size requested by client has been filled up.
As part of verifying results against block/filter lists and preparing the response SDK will look up each claim_id in it's own local database to get the latest txid:nout for the claim and all other metadata needed to return a consistent response regardless of which search service was used.

kauffj commented 4 years ago

Looks roughly good to me. Some comments/thoughts:

It is metadata intended only for extra data? For example, a search score or something like this? I think this is meant this way, but confirming. It may make sense to label this xxx_metadata rather than just metadata since unlabeled metadata is frequently assumed to be claim metadata.
Can methods be called just claim_search and claim_search_features rather than new top-level naming? And with some sort of design that supports somewhat seamlessly using the plugin when possible, otherwise falling back? Ideally, if lbry-desktop is updated to use claim_search always, it should work regardless of whether connected to a wallet server with a search plugin or not. It's okay if search works less well, but we'd like to find a design where this doesn't break.

eukreign commented 4 years ago

I updated the write up with explanation of the feature fields. But here is further explanation: metadata is specific to the search service, for example, if we move trending out of the SDK and into a search service then the search service may want to provide some extra information about the trending of a particular claim, or if full text search was used there may be some relevancy decimal for each of the result rows, etc. The SDK will simply just pass this down to the client, so search service can put whatever it wants in there. This field is totally optional and the search service may opt to have no extra metadata (therefore its results will just be claim_ids).
The APIs and even protocol is completely different. Currently the SDK wallet servers expose claim_search as a binary protobuf protocol which returns txid:nouts (there aren't even claim_ids in that response). I think it's already confusing as it is with SDK having two completely different incompatible claim_search APIs (one local used by app and one on wallet server used by client SDK), I'm not sure having a 3rd API end point named the same will help things. But having said that I don't feel too strongly about the name, if there is broader consensus to name the RPC function in lighthouse claim_search and claim_search_features i'd be happy to update this issue.

lbryio / lbry-sdk