jzonthemtn commented 9 months ago

User Behavior Insights (UBI)

This RFC has been revised to describe an approach more integrated with OpenSearch. We now call this functionality "User Behavior Insights" (UBI).

Summary

This RFC is an evolution of 4619 to capture user behaviors and track queries through all steps of querying and website usage.

This RFC proposes functionality in OpenSearch to store application user behavior and corresponding queries in OpenSearch indexes. It also includes an analytics dashboard integrated with OpenSearch Dashboards for analyzing and visualizing the collected information.

UBI will link client-side actions with backend search actions, such as linking queries submitted by users with customer clients, scroll depth, and search result detail pages viewed.

What users have asked for this feature?

This functionality has been discussed on the OpenSearch Search Relevance Meetup and through individual conversations with users of OpenSearch and with the larger community.

What problems are you trying to solve?

The key problem is that OpenSearch users are missing a holistic view of client-side, browser, and app events to enable a deeper understanding of search user behavior for the purposes of improving search relevance and user experience.

With this tooling, users of OpenSearch will be able to collect client-side events and link them with queries from their data stores. This will allow users to create a comprehensive view of users’ search journeys to improve the user experience.

What is the developer experience going to be?

Pre-Existing Work

The work described here has been successfully implemented as an OpenSearch plugin. Due to several factors such as maintaining a plugin, promoting adoption, and ease of use within OpenSearch, it has been determined that a plugin is not the optimal approach. This RFC has been updated to reflect this new direction.

For a description of the plugin's implementation, please see previous revisions of this issue or the plugin's repository.

Proposed Work

Core Contributions

All functionality will be directly implemented in the OpenSearch github project. Queries performed against OpenSearch along with the list of query results will be persisted to an OpenSearch index.
Two indexes (described below) will be created to facilitate the persistence of client-side events.
Clients will be responsible for indexing client-side events in OpenSearch; this project will not add any endpoints to facilitate the indexing of events. This allows clients to use whatever method they prefer to index client-side events, whether it be directly indexing, using a custom pipeline, DataPrepper, OpenTelemetry, or other method of their choice.

Persistence of Queries and Client-Side Events

Queries, including their results, and client-side events will be indexed to two OpenSearch indices. One index will contain the queries, and the other will contain the client-side events.

These indices are .ubi_queries and .ubi_events. They will be automatically created and store queries and events for all OpenSearch indexes. (In the plugin implementation there was the concept of a "store" and there was a one-to-one correlation with a store and an OpenSearch index. This is no longer necessary as it can be accomplished with only these two indexes.)

Schema of Queries Index

The queries index will contain all queries that were received by OpenSearch which include a top-level ubi block. The timestamp, query_id, and other information about the query will be indexed.

{
  "dynamic": false,
  "properties": {
    "timestamp": {
      "type": "date"
    },
    "index": { "type": "keyword", "ignore_above": 100 },
    "query_id": { "type": "keyword", "ignore_above": 100 },
    "query": {
      "type": "text"
    },
    "query_response_id": { "type": "keyword", "ignore_above": 100 },
    "query_response_hit_ids": { "type": "keyword" },
    "user_id": { "type": "keyword", "ignore_above": 100 },
    "session_id": { "type": "keyword", "ignore_above": 100 }
  }
}

Schema of Client-Side Events Index

The events index will contain the client-side events indexed into OpenSearch by the client. Some fields are standardized; most are optional. Others can be customized as needed.

{
  "properties": {
    "query_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "action_name": {
      "type": "keyword",
      "ignore_above": 100
    },
    "user_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "session_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "query_id": {
      "type": "keyword",
      "ignore_above": 100
    },
    "page_id": {
      "type": "keyword",
      "ignore_above": 256
    },
    "message": {
      "type": "keyword",
      "ignore_above": 256
    },
    "message_type": {
      "type": "keyword",
      "ignore_above": 100
    },
    "timestamp": {
      "type": "date",
      "doc_values": true
    },
    "event_attributes": {
      "properties": {
        "user_name": {
          "type": "keyword",
          "ignore_above": 256
        },
        "user_id": {
          "type": "keyword",
          "ignore_above": 100
        },
        "email": {
          "type": "keyword"
        },
        "price": {
          "type": "float"
        },
        "ip": {
          "type": "ip",
          "ignore_malformed": true
        },
        "browser": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "position": {
          "properties": {
            "ordinal": {
              "type": "integer"
            },
            "x": {
              "type": "integer"
            },
            "y": {
              "type": "integer"
            },
            "page_depth": {
              "type": "integer"
            },
            "scroll_depth": {
              "type": "integer"
            },
            "trail": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "object": {
          "properties": {
            "key_value": {
              "type": "keyword"
            },
            "object_id": {
              "type": "keyword",
              "ignore_above": 256
            },
            "object_type": {
              "type": "keyword",
              "ignore_above": 100
            },
            "transaction_id": {
              "type": "keyword",
              "ignore_above": 100
            },
            "name": {
              "type": "keyword",
              "ignore_above": 256
            },
            "description": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "to_user_id": {
              "type": "keyword",
              "ignore_above": 100
            },
            "object_detail": {
              "type": "object"
            }
          }
        }
      }
    }
  }
}

Query Requests and Query Responses

Assumption: the user is on a search-enabled website powered by OpenSearch containing the functionality described above.

When the user performs a search on the website, the query is sent to OpenSearch with a ubi block in the request. This ubi block provides information about the search and the presence of the block tells OpenSearch to persist this query and the query's results. An example ubi block is:

GET _search
 {
  "ubi": {
    "query_id": 300d16cb-b6f1-4012-93ebcc49cac90426,
    "options": {
      "robot":false,
      "mobile":true,
      "experiment_id":"exp00456"
    },
   },
   "query": {
     "query_string": {
       "query": "the wind AND (rises OR rising)"
     }
   }
 }

The fields and their names in the ubi block may change, but the important part is the query_id value which uniquely identifies this search. This value is used to link client-side events with searches, and vice-versa. If the query_id value is not provided, OpenSearch will generate a random query_id and return its value in the search response.

The presence of the ubi block in the search request causes OpenSearch to index the query and the query results.

Every search result has a unique ID. That result ID can be carried through the whole reporting system so that all actions are correlated with the result they came from. In many applications, there is additionally a unique item ID which identifies the underlying object which is referred to by the result ID. There is an N-to-1 relationship between item_ID and result_ID. That is, the same object may have been returned as result 2 of search 1234, and as result 7 of search 3456.

Similarly, the search response will be modified to also include a ubi block:

{
 "took": 13,
  "timed_out": false,
  "ubi": {
    "query_id": "300d16cb-b6f1-4012-93eb-cc49cac90426"
  }
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.9808291,
    "hits": [
      {
        "_index": "students",
        "_id": "1",
        "_score": 0.9808291,
        "_source": {
          "name": "John Doe",
          "gpa": 3.89,
          "grad_year": 2022
        }
      }
    ]
  }
}

In the example above, the search response has been modified to include a ubi block which contains the query_id. If a query_id was provided in the query request, this will be the same value. If a query_id was not provided in the query request, the query_id in the response will be a random UUID. It is recommended that clients manage their own query IDs but OpenSearch will generate a random query ID when necessary to avoid any breaking behavior or undesired effects.

Client-side JavaScript Reference Implementation

A reference implementation of the JavaScript client-side code to capture common events and index those events in OpenSearch will be provided. The code is not intended to be comprehensive or complete, but rather a starting point for users to modify to meet their unique needs.

Code Drops

The Code Drops described below were chosen to be atomic pieces of work suitable for pull requests and review/commit by OpenSearch maintainers. They were similarly selected to avoid any breaking changes. All Code Drops include the appropriate documentation and tests.

Code Drop 1 - Passing a ubi block in a query request and receiving back a search response with a ubi block containing the received query_id or a generated query_id if none was sent by the client.
Code Drop 2 - Automatically creating the indexes to store the queries and client-side events.
Code Drop 3 - Queries received containing a ubi block are persisted to the queries index.
Code Drop 4 - Addition of any user-configurable options for customizing operation.

Open Source and Best Practices

Research of currently available open source libraries under acceptable licenses will be conducted to discover which can be either utilized directly or customized to meet our needs.

We will “program to the interface” to permit future extensibility. For instance, while event data will be stored in OpenSearch, there will be no restrictions on creating the ability to use a relational database as the backend instead.

The development plan will evolve over time. Whenever possible, so as to not reinvent the wheel, priority will be given to the use of existing open source code as well as the application of existing standards.

Are there any security considerations?

Event data will be sent to OpenSearch for indexing and all communication needs to be over secure channels.
The client-side event capturing code must behave ethically and only track user activity when permitted.
Strict security parameters and constraints must be in place to connect the client code to the backend OpenSearch logging engine.

The community’s input around these items will be vital during development.

Are there any breaking changes to the API?

No breaking changes to the API are expected.

What is the user experience going to be?

The user will be able to analyze the collected events via a dashboard that is integrated with OpenSearch Dashboards. This functionality will likely be implemented as a its own OpenSearch Dashboards plugin or integrated into the OpenSearch dashboards-search-relevance plugin.

The data will be queryable using SQL and/or DSL, and be exportable to an external data store for additional analysis or training machine learning models.

Are there breaking changes to the User Experience?

No breaking changes to the user experience are expected.

How is this different from other click-tracking applications?

It is focused on the highly granular data collection and analysis necessary for search relevance tuning, not on reporting on aggregates.
- It gives developers full control of and access to their data.
- It takes advantage of OpenSearch’s ability to log and analyze data, while leaving the client developer free to choose whichever Javascript library they want, such as Snowplow, etc.
- This will provide a stronger “out of the box” search analytics focus than more general tools.
- Stretch goal: near real time event tracking, with an eye to being able to provide data for personalization as the user is engaging with the search experience. Simply put, the ability to learn about an individual's preferences, not just focus on aggregated user preferences.

peternied commented 9 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @jzonthemtn Thanks for writing this up, looking forward to seeing more details as this gets worked on.

epugh commented 9 months ago

We have a call for names for this tooling, join the thread at https://opensearch.slack.com/archives/C051JEH8MNU/p1706895236557209

shikeli commented 9 months ago

Thanks for proposing this. I would like to treat events as metadata. In our use case, we have metadata information in rest API request header. Metadata information like client_id, data object type, etc. We really need raw opensearch request body and response body logged in AWS cloudWatch or s3 for query pattern analysis. We did plugin explore and we found hard blocker in opensearch security plugin. The reason is opensearch security plugin already did getRestHandlerWrapper override. We cannot override this in our plugin to do our own logging and this restHandlerWrapper is only way for us to do this as far as I know. I hope we can prioritize this RFC for community. Or if anyone can guide me to generate log in alternative way will be appreciated.

Gaganjuneja commented 9 months ago

Hi @jzonthemtn, Thank you for initiating this. Indeed, this feature holds significant potential. I firmly believe that implementing the mentioned feature is achievable by utilizing the Request Tracing and Metrics framework, which encompasses both traces and metrics. This feature is already launched as an experimental feature in OpenSearch 2.11 release.

We currently leverage OpenTelemetry, an open-source and widely embraced telemetry solution, which provides a solid foundation for this endeavor. Moreover, we can utilize OpenSearch Dashboard and other observability tools like Prometheus and Grafana to construct a comprehensive dashboard for monitoring and analysis purposes.

cc: @reta

jzonthemtn commented 9 months ago

Thanks for proposing this. I would like to treat events as metadata. In our use case, we have metadata information in rest API request header. Metadata information like client_id, data object type, etc. We really need raw opensearch request body and response body logged in AWS cloudWatch or s3 for query pattern analysis. We did plugin explore and we found hard blocker in opensearch security plugin. The reason is opensearch security plugin already did getRestHandlerWrapper override. We cannot override this in our plugin to do our own logging and this restHandlerWrapper is only way for us to do this as far as I know. I hope we can prioritize this RFC for community. Or if anyone can guide me to generate log in alternative way will be appreciated.

Thanks @shikeli, and thanks for the pointer on the getRestHandlerWrapper override. You are looking to capture the raw queries and their results? Would being able to export the captured metadata to a file format like Parquet work for your purposes?

jzonthemtn commented 9 months ago

Hi @jzonthemtn, Thank you for initiating this. Indeed, this feature holds significant potential. I firmly believe that implementing the mentioned feature is achievable by utilizing the Request Tracing and Metrics framework, which encompasses both traces and metrics. This feature is already launched as an experimental feature in OpenSearch 2.11 release.

We currently leverage OpenTelemetry, an open-source and widely embraced telemetry solution, which provides a solid foundation for this endeavor. Moreover, we can utilize OpenSearch Dashboard and other observability tools like Prometheus and Grafana to construct a comprehensive dashboard for monitoring and analysis purposes.

cc: @reta

Hi @Gaganjuneja, thanks for the links to the Tracing and Metrics RFCs. I am not super familiar with OpenTelemetry, so please excuse on my ignorance on the subject and I appreciate your recommendation of it. The data we want to capture will include events generated client-side (clicks, scroll depth, etc.) tied to backend events (search queries, results for the queries, etc.). When I hear "telemetry" I think of metrics/traces/etc. to support instrumentation of a distributed application to have visibility into the application itself. How do you see our types of events fitting into OpenTelemetry's paradigm of metrics/traces/etc.? Also, the end-users of our event reporting will likely be data scientists, search relevance engineers, and business analysts. Do you think Prometheus and Grafana would be suitable backends to allow those types of users to get the insights they need? Last question -- we want the system to be extensible. If you think OpenTelemetry is a good choice, how would you feel about it being an option? For instance, event data could, by default, be stored in an OpenSearch index and viewed by an OpenSearch Dashboards plugin, but the user could have the option to switch to using an OpenTelemetry/Grafana/Prometheus backend. Your input is much appreciated.

shikeli commented 9 months ago

Thanks for proposing this. I would like to treat events as metadata. In our use case, we have metadata information in rest API request header. Metadata information like client_id, data object type, etc. We really need raw opensearch request body and response body logged in AWS cloudWatch or s3 for query pattern analysis. We did plugin explore and we found hard blocker in opensearch security plugin. The reason is opensearch security plugin already did getRestHandlerWrapper override. We cannot override this in our plugin to do our own logging and this restHandlerWrapper is only way for us to do this as far as I know. I hope we can prioritize this RFC for community. Or if anyone can guide me to generate log in alternative way will be appreciated.

Thanks @shikeli, and thanks for the pointer on the getRestHandlerWrapper override. You are looking to capture the raw queries and their results? Would being able to export the captured metadata to a file format like Parquet work for your purposes?

Thanks for quick response. What do you mean by metadata, is it raw request and raw response? If you can export raw request and response to a file, that should be able to solve our problem.

jzonthemtn commented 9 months ago

Thanks for proposing this. I would like to treat events as metadata. In our use case, we have metadata information in rest API request header. Metadata information like client_id, data object type, etc. We really need raw opensearch request body and response body logged in AWS cloudWatch or s3 for query pattern analysis. We did plugin explore and we found hard blocker in opensearch security plugin. The reason is opensearch security plugin already did getRestHandlerWrapper override. We cannot override this in our plugin to do our own logging and this restHandlerWrapper is only way for us to do this as far as I know. I hope we can prioritize this RFC for community. Or if anyone can guide me to generate log in alternative way will be appreciated.

Thanks @shikeli, and thanks for the pointer on the getRestHandlerWrapper override. You are looking to capture the raw queries and their results? Would being able to export the captured metadata to a file format like Parquet work for your purposes?

Thanks for quick response. What do you mean by metadata, is it raw request and raw response? If you can export raw request and response to a file, that should be able to solve our problem.

@shikeli Yes, the export would be the search requests/responses along with the events generated by the client-side. Our desire is to capture the raw requests/responses, but I'm not yet entirely sure what technical impediments we might encounter (like your getRestHandlerWrapper problem) but raw is our goal.

reta commented 9 months ago

hen I hear "telemetry" I think of metrics/traces/etc. to support instrumentation of a distributed application to have visibility into the application itself. How do you see our types of events fitting into OpenTelemetry's paradigm of metrics/traces/etc.?

Thanks @Gaganjuneja , I would agree with @epugh here, we should be thinking about telemetry as operational instrumentation, the user behaviour sits few level above that. To your point though, there could be cases to derive the user behaviour out of the user-focused metrics if plugin / extension authors would see the need to do so this way, it could be a good complementary channel

deshsidd commented 9 months ago

Thanks @jzonthemtn for the proposal. This is very similar and has overlap with the query insights proposal and ongoing work.

Reference RFCs: https://github.com/opensearch-project/OpenSearch/issues/11008 https://github.com/opensearch-project/OpenSearch/issues/11186

Reference PRs and issues: Query Insights Plugin: https://github.com/opensearch-project/OpenSearch/pull/11903 TopN Queries: https://github.com/opensearch-project/OpenSearch/pull/11904 Search Query Categorization Issue: https://github.com/opensearch-project/OpenSearch/issues/11596

Please see the Query Insights section on the sprint board: https://github.com/orgs/opensearch-project/projects/153/views/8

We also aim to improve the users search experience and search performance. We have similar plans as mentioned above to add instrumentation on the search path, create an analytics dashboard to visualize the metrics, connect user to the queries executed, etc.

Could we try to leverage the insights plugin for the above?

ansjcy commented 9 months ago

Are we mostly focusing on client side logging in this RFC? we can also investigate how to combine client side and server side insights (query insights initialtives as deshsidd mentioned above) together and correlated the information to get more insights, which would be super cool

smacrakis commented 9 months ago

This proposal is about tracking user behavior whether it results it a call to the OpenSearch back end or not. It is about understanding search quality (relevance), not about understanding the performance characteristics of the search server.

Even when it does result to a call to the OpenSearch back end, it may or may not be the same query. For example, when the user searches for [red dress], the application may rewrite that (query understanding) as [red dress] + 0.9taxonomy:dress + 0.9color:red before sending it to OpenSearch for processing (or it might do that in the Search Pipeline).

But most user actions that help us evaluate search quality do not include a call to the OpenSearch back end. For example, clicking on result 3 does not call the back end. Putting result 5 in the shopping basket does not call the back end. etc.

This client-side behavior often needs to be correlated (joined) with the server-side behavior in many cases, for example to capture any processing done by the application, or for performance analysis. But the two are different. For example, the server-side Query Insights is interested in the Top-n slowest queries because they may reflect a performance issue; whereas the client-side User Behavior Logging is interested in the Top-n most common queries, because they help us understand what users are doing. In that particular case, it is possible to collect the data on either the client or the server side (modulo query rewriting), but other cases -- such as the Top-n queries where the user selects none of the results -- require client-side information.

What the right mechanism for capturing user behavior is another question. Should User Behavior Logging use Open Telemetry? That is certainly one possibility.

jzonthemtn commented 9 months ago

Thanks @jzonthemtn for the proposal. This is very similar and has overlap with the query insights proposal and ongoing work.

Reference RFCs: #11008 #11186

Reference PRs and issues: Query Insights Plugin: #11903 TopN Queries: #11904 Search Query Categorization Issue: #11596

Please see the Query Insights section on the sprint board: https://github.com/orgs/opensearch-project/projects/153/views/8

We also aim to improve the users search experience and search performance. We have similar plans as mentioned above to add instrumentation on the search path, create an analytics dashboard to visualize the metrics, connect user to the queries executed, etc.

Could we try to leverage the insights plugin for the above?

Hi @deshsidd, thanks for those links. We're definitely in favor of using existing things where possible so we will take a look and see what overlap exists there.

ashwin-pc commented 8 months ago

@jzonthemtn Thanks for creating this. OpenSearch Dashboards does have a useageCollector built in that we had disabled during the fork that does exactly this. It has a lot of the tooling and features you are discussing here and should be something that might solve this problem immediately. OpenSearch could also build in something similar that it and its plugins can use to add to this.

smacrakis commented 8 months ago

@ashwin-pc Tell us more about usageCollector! Where can we find documentation on it? What is the schema of data it collects? Does it have client-side (Javascript etc.) components to collect search results and actions on them?

ashwin-pc commented 8 months ago

@smacrakis So i've just started looking into this since OSD is looking to solve the same problem. But essentially we have 5 core plugins that do varous things related to telemetry and useage collection in OSD. You can find the existing documentation for each of these here:

They each have a readme outlining their purpose but i'm yet to deep dive into what they do and how they work. I do know that we didnt remove any of this tiooling post the fork and only commented out the section that reports this information to a telemetry endpoint.

ansjcy commented 8 months ago

Interesting proposal! Just went it through and have several questions and comments:

We lean on OpenSearch’s ability to log and analyze data

When you say "analyze" the client side data, does that mean we want to build any user behavior analysis capability within OpenSearch? in other words, will we be building any analysis algorithm, or they are the end users' (as you mentioned "data scientists, search relevance engineers, and business analysts") responsibility?

store the behavior metrics in OpenSearch indices

Have we evaluate other alternatives? I'm a little bit worried about the potential storage impact. I think this also depends on the answer to the previous question - do we need to somehow utilize this user behavior data within OpenSearch? If not we can provide options to export to different sinks (and OpenSearch Index would be one of them).

But from the perspective of "providing overall better performance insights", I would really love to see these data be available within OpenSearch. As I mentioned before, we can invest on generating insights and recommendations from combining user behavior data and server side query insights data (if OpenSearch is also used as the search backend). One use case would be (my wild thought!), knowing what "type" of the user is, we can optimize the search performance by rewriting the search queries based on different user types.

link client-side actions with backend search actions

This might be a implementation-wise question, how to link the client side and server side actions? I'm not sure if it would be an easy task, as @smacrakis mentioned in his comment:

Even when it does result to a call to the OpenSearch back end, it may or may not be the same query. For example, when the user searches for [red dress], the application may rewrite that (query understanding) as [red dress] + 0.9taxonomy:dress + 0.9color:red before sending it to OpenSearch for processing (or it might do that in the Search Pipeline).

smacrakis commented 8 months ago

@ansjcy Thanks for your interest and for your questions.

Although the initial implementation uses OpenSearch as its back end,

We plan to “program to the interface” to permit future extensibility. For instance, we plan to store event data in OpenSearch, but do not want to restrict someone from creating the ability to use a relational database as the backend instead.

In particular, there is no requirement that the same index be used to store the behavioral logs as is used to provide search results, so that the analytics workload won't affect search latency.

As for analysis, our plan is to provide analytics tools in OpenSearch Dashboards. We also expect that the community will supply its own tools running on Dashboards or perhaps elsewhere.

Closing the feedback loop to search results is certainly an important goal. We expect that we'll be able to provide near-real-time access to the results so that search results can be adjusted in-session. As always, the devil is in the details....

ansjcy commented 8 months ago

Thanks for your response!

our plan is to provide analytics tools in OpenSearch Dashboards.

I would still advocate, the query insights plugin should be a good place to hold those analysis tools! We have built top n queries feature in this plugin and will start on the dashboard component (https://github.com/opensearch-project/OpenSearch-Dashboards/issues/5571) to expose these information. If we have the user behavior data stored in an index, it would be straightforward to implement processors for analytics within the query insights plugin and build analytics ui in a similar way.

In this way we can easily combine the client side and server side insights to achieve more, for both performance and analytics purposes.

ashwin-pc commented 8 months ago

Is the goal of this to log all user behaviors or just that specific to one that use opensearch calls such as search? e.g. See if a user on my website has visited a particular page or used a particular feature. If yes, then can OpenSearch Dashboards itself use this framework to track its users for similar behaviours?

epugh commented 8 months ago

Is the goal of this to log all user behaviors or just that specific to one that use opensearch calls such as search? e.g. See if a user on my website has visited a particular page or used a particular feature. If yes, then can OpenSearch Dashboards itself use this framework to track its users for similar behaviours?

You are touching on one of the key points of discussion which is how opinionated (structured?) should we be about what is recorded. The more structured the format of the events/actions/data we capture, then the easier it is to provide valuable out of the box insights via the dashboards, but the more limiting the use cases. If we open up the format to being able to accept a VERY broad set of attributes, then that lets the builder do more amazing things, but at the cost of less structure in our data, harder onboarding process, and fewer "out of the box insights" that can be provided.

dylan-tong-aws commented 8 months ago

@ashwin-pc Tell us more about usageCollector! Where can we find documentation on it? What is the schema of data it collects? Does it have client-side (Javascript etc.) components to collect search results and actions on them?

+1. I would like to know exactly what data we plan to capture to in the first release to validate that we have what is necessary for tuning ML models.

reta commented 6 months ago

I think this feature should not be part of core but 100% plugin or/and extension (this is opt-in functionality and not a core one) The plugins / extensions already have the mechanism to enrich the search request response with ext section), and with extensions there is an option to of off-process / off-node.

smacrakis commented 6 months ago

@reta After discussion, the implementation team has come to the same conclusion, and we are removing most functionality from core. The only part remaining is logging queries and responses, which of course will be under user control. As for the ext section, I suppose we could put the client query ID (whatever we call it) in an ext section in the query (although currently only the response has an ext section).

lukas-vlcek commented 6 months ago

Based on the update that @epugh presented today on community call I would like to point out that all the UBI data should be possible to store outside the "production" cluster. Actually, storing this data into the same cluster should be possible only for easy "try-out" scenario but should not be considered for any real use case IMO.

Not only managing indices for UBI will take resources (and it might be hard to control) but it may be required by legislation to store, backup and treat this data in very specific way. (I get it that the data is anonymous but still it can contain very sensitive information).

reta commented 6 months ago

in an ext section in the query (although currently only the response has an ext section).

@smacrakis Not only responses, the search requests have ext section as well.

smacrakis commented 6 months ago

@reta Interesting -- the 2.13 doc for _search only says "plugin authors can add an ext object to the search response", but the doc for the Rerank processor includes an ext on search. Looks like a documentation bug.

jzonthemtn commented 5 months ago

Regarding the plugin/non-plugin conversation, UBI development will proceed as an external plugin in its own repository and not as a module/plugin inside the opensearch-project/Opensearch repository. Thanks to everyone involved in that conversation.

With that direction now known, I would like to see about closing this RFC. Everyone involved with UBI is still very much open to the community's thoughts (and contributions :), but the upcoming UBI plugin repository might be a better location for those conversations. I'm not familiar with the process to close an RFC so please let me know if there are any objections to doing so.

jzonthemtn commented 2 months ago

Closing this RFC because the initial implementation is now available at https://github.com/opensearch-project/user-behavior-insights and new issues can be created there.

epugh commented 2 months ago

Interested in this topic? Learn more at https://github.com/opensearch-project/user-behavior-insights and https://opensearch.org/docs/latest/search-plugins/ubi/index/.

opensearch-project / OpenSearch

[RFC] User Behavior Insights #12084

User Behavior Insights (UBI)

Summary

What users have asked for this feature?

What problems are you trying to solve?

What is the developer experience going to be?

Pre-Existing Work

Proposed Work

Core Contributions

Persistence of Queries and Client-Side Events

Schema of Queries Index

Schema of Client-Side Events Index

Query Requests and Query Responses

Client-side JavaScript Reference Implementation

Code Drops

Open Source and Best Practices

Are there any security considerations?

Are there any breaking changes to the API?

What is the user experience going to be?

Are there breaking changes to the User Experience?

How is this different from other click-tracking applications?