feat: Add an API to allow for querying events under a subscription with additional filtering

hyperledger / firefly

Hyperledger FireFly is the first open source Supernode: a complete stack for enterprises to build and scale secure Web3 applications. The FireFly API for digital assets, data flows, and blockchain transactions makes it radically faster to build production-ready apps on popular chains and protocols.

https://hyperledger.github.io/firefly

Apache License 2.0

508 stars 209 forks source link

feat: Add an API to allow for querying events under a subscription with additional filtering #1452

Closed SamMayWork closed 9 months ago

SamMayWork commented 10 months ago

ref: https://github.com/hyperledger/firefly/issues/1443

Adds a new fancy API: subscriptions/{subid}/events

Internally iterates through all of the events, finding those that match a subscription
Allows for normal query filtering against the filtered list of subscriptions

My current concern is that this query is an expensive operation because by necessity, it needs to be O(n) as we have to go through all of the events to find those that match the subscription type. Additionally, skip provided by the user does not help us here as we don't know how many of the subscribed events actually match our filter. If the provided filter does not match any events, we will iterate through all of the events in the DB.

Local benchmark for processing ~93,000 events in FireFly was ~30 seconds on this API with a subscription that matches no events (I.e. going through all of the events and running checks on them). An internal limit of <100 takes roughly 30-45 seconds to complete, higher figures tend more towards ~30 seconds. Increasing the internal limit past 200 does not substantially decrease the time required for the query.

codecov-commenter commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (fd542c0) 99.99% compared to head (45fc854) 99.99%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1452 +/- ## ======================================= Coverage 99.99% 99.99% ======================================= Files 321 322 +1 Lines 23175 23270 +95 ======================================= + Hits 23173 23268 +95 Misses 1 1 Partials 1 1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

SamMayWork commented 10 months ago

WRT O(n) performance:

Since there is no existing store of events that match subscriptions, when we get a request, we must index all of the events to find which of those match the subscription that has been provided
A user-provided skip/limit does not significantly reduce the amount of processing that we internally have to do since we don't know how many records would actually match the subscription filter in order to be able to do a skip
The most expensive possible query is a filter that matches no records, as this would require us indexing the entire DB

To my knowledge the only way around these problems would be:

Make a change to keep a record of events that match subscriptions, and then use this a source of truth for this API, however we then get into challenges around storing this information like ensuring that sequencing is done properly
Implement Regex DB semantics at a lower level (however we'd still need to index a lot of records)

SamMayWork commented 10 months ago

Spotted a problem during testing that might be a bit tricky to solve. Given there are likely to be X number of events on the chain where X is a high number, being restricted to skipping 1000 records (as is the default API behaviour) is super restrictive. There's an entirely valid use-case for skip 90,000 records and then return me 200 records matching my filter but this would not work as of right now.

I've pushed changes in this commit https://github.com/hyperledger/firefly/pull/1452/commits/52c191e9981473dfa73248386a1e033fd23d6399 which allows for a custom (and :sparkles: configurable :sparkles:) number of events to be skipped, buuuuuut this doesn't play nicely with the swagger generation, so the generated API documentation isn't updated to reflect the new value. This means there's a disconnect right now between what our backend supports and what we say it supports in the swagger.

Needs more thinking for a workaround...

EDIT: Commit https://github.com/hyperledger/firefly/pull/1452/commits/80ddde2d4e0528cda9c4566eba72f6723cdb5dd5 has a nifty but imperfect workaround...

Nifty: After the swagger is generated by the core libraries, we grab the params and overwrite the description on the skip field with our own description with the actually correct value for the maximum supported sip

Imperfect: If another API has the same issue at this API, they'll need to duplicate the code I've added to change the values on their API. What would be great would be if the values for skip/limit were configurable per route and then that got read in through to the Swagger get, but this is likely a larger item of work.

SamMayWork commented 9 months ago

In discussion w/ @nguyer, we've landed on some further changes for this PR given the fact that we're concerned about the usefulness of this API if we don't solve the N+1 events problem. We've agreed on the following changes:

a starting/end sequence ID can be provided which determines where we start/stop indexing from
a limit of matching records can be provided which will takes precedence over the end sequence ID

This means that assuming that you hit the limit of matching events before you hit the end of the sequence range, you can get the next event in the sequence by calling the API again and providing the starting sequence ID to be the sequence ID of the last record. The hope is that the DB should be way faster indexing by sequence ID so performance should be come significantly less of an issue.

These changes semantically change how we're using some of the values so in the new world this is what we have:

startsequence: Start ID to index from
endsequence: End ID to index from
skip: Unused (we'd expect changes to be made using startsequence)
limit Limit for total count of FILTERED records

EDIT: Will also need to make the new configuration options make sense with the changes listed above, probably going to remove both the new options and have a configurable limit on the difference between the start and end sequence IDs.

Changes on the way :rocket:

SamMayWork commented 9 months ago

Comment above about the SQL performance of this query syntax, but I think we're in a good place for a re-review now :rocket:

nguyer commented 9 months ago

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (fd542c0) 99.99% compared to head (a08f80b) 99.94%.

Files Patch % Lines internal/events/event_manager.go 50.00% 10 Missing ⚠️ Additional details and impacted files ☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

Looks like codecov is just having trouble talking to GitHub. I confirmed by checking out this branch myself that it does not lower test coverage.

SamMayWork commented 9 months ago

I get how to use it once we have a sequence number as a reference and we can go higher or lower from there. I think most of the time users are going to be interested in the most recent n events though and don't know where to start from.

Some initial thoughts on this here:

The default behaviour could be that if no values are provided, we go to the most recent event and show the last X records
We could return another value in the response object which contains the most recent event ID

I think option 2 here is just bad since we're going to need to make another query to the DB to find out the current highest index and then tag this onto a request that the user might not want to make. I.e. If I was a user, and wanted the most recent event I'd need to make a request I don't care about first to get the index so I can make a request I care about - that doesn't make sense.

Option 1 is the way forward, and it's consistent with the /events API which this API is more or less the same as. With that in mind, I think this becomes the behaviour of the API with respect to defaults for start/end sequence IDs.

Start Sequence ID provided (SS)	End Sequence ID provided (ES)	Calc(startsequence)	Calc(endsequence)
Y	Y	SS	ES
N	Y	ES-1000 (*1)	ES
Y	N	SS	SS+1000 (*1)
N	N	Most recent record sequence ID - 1000	Most recent record sequence ID

So one little wrinkle though is making sure that the Swagger reflects the behaviour and that the user is informed that if they don't provide a value, they'll get the most recent events. The /events API doesn't have this problem since start/end sequence aren't explicitly defined.

(*1) - Default max range is 1000 records, through configuration this value could be changed

EDIT: I think this should be addressed now in https://github.com/hyperledger/firefly/pull/1452/commits/0395957853f0dd939c797a22042d542c6de99cba

SamMayWork commented 9 months ago

Right, so it looks like there's a flaky e2e at the moment which I've seen pass and fail on different commits on this branch, I'll dig into it here for a little bit, but I'd be surprised if the changes I have made would be causing this test to fail given the test is failing during multi-party set up, initial look suggests that one of the nodes is failing to register its org.

EDIT: Spent some time looking at this, and it looks like the tests are a bit flaky on main too, I think for now we should re-run the failing E2E on this PR, and it should work. Probably worth a separate item to look into the E2E failures.

Note: Could not get the E2E's running on my M1 mac, looking in the Dockerfile there's a lot of AMD64 stuff, could spend some cycles looking into this if required but not sure it's worth the time given we know this is a flaky test.