kaspanet / kasparov

Kasparov is an API server for Kaspa written in Go (golang).

ISC License

5 stars 4 forks source link

Time-range-driven block access API #48

Open aspect opened 4 years ago

aspect commented 4 years ago

As per our recent discussions with @ey51 I would like to file a feature request for a Kasparov API call that would offer /blocks-like API accepting timestamp ranges as input. Timestamps can be indexed in the database, making these lookups very fast. I realize timestamps are not 100% reliable, but right now I don't see any better alternative available.

Timestamp-range query API could use similar API parameters as /blocks where it could use skip and limit, measured in seconds.

Current /blocks API returns a maximum of 100 blocks. Unsure why this restriction is there, but if this is done to prevent Kasparov from stalling on large queries, then the time-driven API can return a JSON object containing an array of blocks as well as the timestamp up to which the data has been delivered. We can then use this timestamp as the starting point of the next query.

Purpose:

The purpose of this API is to be able to get a segment of the chain, primarily for visualization purposes. We would like to be able to see what the chain looks like within a specific time segment.

Challenges:

Due to the way branches are organized, using /blocks does not really allow us to get a reliable snapshot due to a number of issues (there is no real relation between sequential block numbers and timestamps). Time-driven API does address most of the issues, but one. Consider a scenario where the time range intersects with the branch, but blocks on this branch are present outside of this time range as follows:

In this condition, for the time range t0 to t1, it is possible to identify blocks e,g,h as they are within the time range, but it is not possible to identify blocks b and c if the query is done in a simple linear t >= t0 && t < t1 manner. In order to obtain the full picture, the query should produce not only blocks within the range, but also branches that intersect this time range.

I am not quite sure of the best way to handle this in terms of database queries as doing this type of processing could be very costly. Having an additional auxiliary field such as acceptedBlockTimestamp would simplify the process and allow to calculate time intersections much easier since a single record (row) will contain both current and next block timestamps making it possible to filter such cases by block.timestamp < t0 && block.acceptedBlockTimestamp >= t1.

Time intersection processing would allow us to broaden the search scope, allowing us to collect blokcks d,f and i as well. (inclusion of blocks outside of the time range that are connected to blocks inside of the time range could be enabled via an additional API argument).

ey51 commented 4 years ago

Proposed new REST methods

Following our conversation, we'll have a discussion on these methods to see which ones make the most sense to do. If you have any inputs, let me know.

GET /blocks/{hashes}

Input: a list of block hashes.
Output: a list of blocks with the given input hashes.
Purpose: to get several blocks at once, instead of sequential single block get calls. For instance, when needing all parents of a known block, or for getting a bunch of unknown blocks that are parents of known blocks.

GET /blocks/common_ancestor/{hashes}

Input: a list of block hashes
Output: a list of blocks that are ancestors of the given input blocks, until they have one common ancestor.
Purpose: let’s say an explorer client received a subsection of the blockDAG, that forms two or more disconnected blockDAGs (blocks 3..9 within t_1 → t_2). The purpose of this call would be to give as input the hashes of blocks 3,4,5,6 and to traverse the blockDAG through their parents, until they are all connected through one or more common ancestor blocks. In this case, it is blocks 1 and 2 and the call would return blocks 1,2.

Kaspa Glossary

GET /blocks/common_descendant/{hashes}

Input: a list of block hashes
Output: a list of blocks that are descendants of the given input blocks, until they have one common descendant.
Purpose: similarly to the common_ancestors method above, let’s say an explorer client received a subsection of the blockDAG, that forms two or more disconnected blockDAGs (blocks 3..9 within t_1 → t_2). The purpose of this call would be to give as input the hashes of blocks 7,8,9 and to traverse the blockDAG through their children, until 7,8 and 9 have a common descendant block. In this case, it is block B and the call would return B.

GET /blocks/{timestamp_range}

Input: a time range, e.g. a start time and/or an end time.
Output: a list of blocks that have a timestamp >= start time and/or <= end time.
Purpose: to get a subsection of the blockDAG with blocks reported to have been created between two timestamps.
- Disclaimer: It is understood that not all the blocks returned by such a method must be necessarily connected as a single graph (as in the illustration above). The missing blocks can be obtained by getting them by their common_ancestor.
- Moreover, there might be some missing blocks referenced by other blocks within the graph, e.g.: block 8 although structurally within the graph t_1 → t_2, might have a timestamp outside of it. The missing blocks can be obtained by getting them by their hashes.

GET /blocks/{database_id_range}

Input: a database block id range, e.g. a start block id and/or an end block id.
Output: a list of blocks that have a timestamp >= start time and/or <= end time.
Purpose: to get a subsection of the blockDAG with blocks ids within a given database block id.
- Disclaimer: It is understood that not all the blocks returned by such a method must be necessarily connected as a single graph (as in the illustration above). The missing blocks can be obtained by getting them by their common_ancestor.
- Moreover, there might be some missing blocks referenced by other blocks within the graph, e.g.: block 8 although structurally within the graph t_1 → t_2, might have a timestamp outside of it. The missing blocks can be obtained by getting them by their hashes.

ey51 commented 4 years ago

Post discussion with the other devs, I have the following feedback:

Getting blocks by common ancestor and descendant

There are a few reasons not to go for these:

It is heavy on the server side because it requires some sort of recursive traverse.
It could potentially lead to returning a very large subset of the blockDAG and could be an easy DoS attack vector.
It might give a misleading representation of the blockDAG, because of not displaying blocks in anti-cones of the returned common ancestor subDAG.
For connecting disjoint subDAGs, simply getting the previous/next bunch of blocks (complete cross section of the DAG) is easier and covers the basic use case of panning the visualization on the time axis.

Getting blocks by database id

This should be avoided for abstraction purposes. If the underlying database is changed, it shouldn't matter for the clients using the API.
Getting by time range covers roughly the same uses.

Getting blocks by hashes

We could add this, however there are a couple of points to mind:

There is max request length, that needs to be minded, and it varies between browsers and servers. Doing a quick lookup I found that to it's suggested to stay under 2000 characters to be on the safe side. @aspect Could you examine this further and advise?
The URI also counts in. The number of blocks that might be needed at the same time is roughly dependent on PHANTOM's k parameter. We are talking about k=10 now. To be on the safe side, we could say that getting max 20 blocks at once (20 blocks * 64 chars per block hash = 1,280 leaving much room for all the overhead).
The trade-off with making multiple single block hash requests is not far. You could make several simultaneous one block requests. Could you advise how badly you would need a call that gets up to 20 blocks by hash at once, instead of making roughly 10 single block by hash calls simultaneously? @aspect

Getting blocks by time range

We will add this to the backlog. How close are you to needing this?

ey51 commented 4 years ago

Challenges:

Due to the way branches are organized, using /blocks does not really allow us to get a reliable snapshot due to a number of issues (there is no real relation between sequential block numbers and timestamps). Time-driven API does address most of the issues, but one. Consider a scenario where the time range intersects with the branch, but blocks on this branch are present outside of this time range as follows:

In this condition, for the time range t0 to t1, it is possible to identify blocks e,g,h as they are within the time range, but it is not possible to identify blocks b and c if the query is done in a simple linear t >= t0 && t < t1 manner. In order to obtain the full picture, the query should produce not only blocks within the range, but also branches that intersect this time range.

I am not quite sure of the best way to handle this in terms of database queries as doing this type of processing could be very costly. Having an additional auxiliary field such as acceptedBlockTimestamp would simplify the process and allow to calculate time intersections much easier since a single record (row) will contain both current and next block timestamps making it possible to filter such cases by block.timestamp < t0 && block.acceptedBlockTimestamp >= t1.

Time intersection processing would allow us to broaden the search scope, allowing us to collect blokcks d,f and i as well. (inclusion of blocks outside of the time range that are connected to blocks inside of the time range could be enabled via an additional API argument).

As you mentioned, it is difficult to get those blocks outside the time range without some complex querying or without adding more fields to blocks in the DB.

My opinion is that it's not really necessary at this point, and in some cases it is even not wanted because it can give off a misleading representation of the topology of the DAG.

To get block d you can get it by sequential get block calls with parentBlockHash(es) of e.

To get f and i we can later add a filter to get blocks query by parentBlockHashcontains=.

In my opinion, after discussing it a few times, it is unwanted for the reason mentioned above (misleading representation of the DAG) because you only show extensions on the slice of the DAG you have in some of it's breadth, while not necessarily through all of its breadth.

A better way to show what's beyond the cross section of the DAG currently displayed will be to get the next adjacent cross section (on either end).

ey51 commented 4 years ago

I want to add on what I wrote earlier. What you'de first want to do is sign up for mqtt notifications on new blocks and on dag/selected-tip. Then you fetch a corss section of the dag using get blocks, e.g. latest X blocks or using some time filters. You might want to do sequential requests due to max page size.

The database inserts blocks by sequential id and it also orders them in the get blocks query by sequential id.

If you are getting blocks between t1 and t2 and say there are 500 blocks that answer that criteria, you would be able to get 100 blocks per request (assuming the page limit is 100). If you order them ascending, you will get the first 100, then you could either do another exact same call with skip 100, to get the next page, until you get to 500, or you could make the next call use the timestamp of the latest block you got in the current call as the startDate.

In each call, you will get a total that counts all the blocks that answer the filter criteria (startdate, enddate [and other criteria we may add according to request], but not skip).

If you get a new bigger total with the same time range, you know that there are new blocks in the time range. The new block will be at the end when you get by ascending order, because of how the database sorts them by db id sequentially.

If you get the blocks in descending order, and you use paging with skip, newly added blocks will be in the beginning, and you are going to skip them and then get an overlap of blocks in the next call - just so you are aware of this.

If you are getting mqtt notification of new blocks though, you will have gotten the new blocks that way.

ey51 commented 4 years ago

@aspect consulting with you on this.

Goal

Add timestamp based fetching of blocks.

Current State

GET /blocks does not receive filtering parameters .
The total in the response returns the number of blocks since Genesis, allowing to request blocks with pagination.

Proposal

GET /blocks request

Add two new optional parameters to GET /blocks request:

StartDate
EndDate

They accept Unix time.

Response

The response will include an array of blocks that answer the given criteria:

block.timestamp >= startDate
block.timestamp < endDate

The field total, included in the the response, will now return all blocks that answer the filter, so that total = number of blocks that have:

block.timestamp >= startDate
block.timestamp < endDate

Jira Issue

NOD-803

aspect commented 4 years ago

This API needs to provide the ability to fetch the same range in multiple requests if the range selection offers more than 100 records/blocks. This proposal lacks this ability in that it provides the total but does not have a way to use this total obtained to query the remainder of the data.

i.e. if it supplies a total in the response the query arguments should support skip.

When integrating DAViz I have tried but was unable to use Kasparov API directly as DAGViz needs to have a variety of auxiliary data, such as a delivery of parent and child blocks at the same time. (So DAGViz internal data structures differ as it pre-processes the data as it fetches it from Kasparov;).

The API I ended up designing as a DAGViz backend is reminiscent of the above proposal, however, I have also added a unit argument, that allows me to specify timestamp or blueScore or lseq (which is a special thing in DAGViz referring to a linear database sequence i.e. database id).

Subsequently, the DAGViz API has from and to (as this can apply to different units).

ey51 commented 4 years ago

The time range fields are new fields to the existing GET /blocks request, that already has the usual order, skip and limit optional parameters. So you could GET /blocks?startDate=123&endDate=456&order=asc&skip=100&limit=100. If there are 500 blocks between startDate=123&endDate=456 then you would get total=500, and then could make the same call with skip 100 to get the next page. Alternatively, you could use the latest timestamp of the blocks in the response, and use that as the startDate of your next call.

We could add childBlockHashes to blocks in Kasparov, if you need this.

ey51 commented 4 years ago

When integrating DAViz I have tried but was unable to use Kasparov API directly as DAGViz needs to have a variety of auxiliary data, such as a delivery of parent and child blocks at the same time. (So DAGViz internal data structures differ as it pre-processes the data as it fetches it from Kasparov;).

@aspect can you elaborate where do you need the child blocks of a block. We want to know how necessary it is to add this to Kasparov, because this requires also updating all child blocks with each new block coming from mqtt /blocks

Perhaps you need a havingParentBlockHashes=list of block hashes filter on the GET /blocks instead, or a separate method similar to GET /blocks to GET childBlocks that does what the filter above does?

Also, still waiting for your response on the GET /blocks with the time filters. Is the proposal and clarification satisfying?

aspect commented 4 years ago

I will write up a detailed description later today depicting issues I am seeing.

The reason I need child blocks is strictly DAGViz-specific and I don't think this would be needed anywhere else: When you pan left and a block appears, you need to rebuild links to parents. Luckily you have parentBlockHashes and you can do that; However, if you pan right, now you have a parent that appears, but you don't know who his children are. It is computationally expensive, for each new block (as a result of scroll/pan/navigation), to re-scan the entire in-memory dataset to identify which blocks are children, so I need to pre-structure this.

Not that it is super computationally expensive, but due to the nature of browser rendering and single-threaded javascript model, these types of calculations can introduce stuttering in the UX. There are other ways around this, but pre-computing child hashes is the most effective.

In other words - this is the information we already have, but we need it pre-computed before it arrives in the browser.

ey51 commented 4 years ago

What about pre-fetching adjacent blocks' children, so to build a small buffer before the user pans in the direction of the future?

In light of your answer, is there still a need for the time-based GET /blocks method?

On Sun, Mar 8, 2020 at 7:33 PM aspect notifications@github.com wrote:

I will write up a detailed description later today depicting issues I am seeing.

The reason I need child blocks is strictly DAGViz-specific and I don't think this would be needed anywhere else: When you pan left and a block appears, you need to rebuild links to parents. Luckily you have parentBlockHashes and you can do that; However, if you pan right, now you have a parent that appears, but you don't know who his children are. It is computationally expensive, for each new block (as a result of scroll/pan/navigation), to re-scan the entire in-memory dataset to identify which blocks are children, so I need to pre-structure this.

Not that it is super computationally expensive, but due to the nature of browser rendering and single-threaded javascript model, these types of calculations can introduce stuttering in the UX. There are other ways around this, but pre-computing child hashes is the most effective.

In other words - this is the information we already have, but we need it pre-computed before it arrives in the browser.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaspanet/kasparov/issues/48?email_source=notifications&email_token=AJMBM33XV5DHTQPONRNDWWTRGPQHZA5CNFSM4KXHIM42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOE4CJA#issuecomment-596230436, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMBM3YGTBAHU4JQ5GY35C3RGPQHZANCNFSM4KXHIM4Q .

-- "Not everything that counts can be counted, and not everything that can be counted counts." (Albert Einstein)