status-im / status-protocol-go

Status Protocol implementation in Go
Mozilla Public License 2.0
0 stars 1 forks source link

Move messages from status-react #42

Closed adambabik closed 5 years ago

adambabik commented 5 years ago

Moving messages from status-react to status-protocol-go.

There is a new struct in the root dir called Message which encapsulates all fields defined for a message in status-react.

Todo

Closes #5

cammellos commented 5 years ago

TLDR; I want to avoid creating a cursor column

This is a well known problem, there's quite a few things you might want to have a look at https://jsonapi.org/profiles/ethanresnick/cursor-pagination/ or https://www.sitepoint.com/paginating-real-time-data-cursor-based-pagination/ https://slack.engineering/evolving-api-pagination-at-slack-1c1f644f8e12 , as this is a well established pattern, and I think there's value in re-using established patterns.

We can avoid using a cursor (we don't in status-react for example, as realm does not allows < or > on strings), but it comes with either drawbacks or complications (in status-react we keep track of the message-ids and EXCLUDE them in the next query, for example).

Basically what we want is:

1) Pagination should always return a bounded number of results (i.e we can't use clock value alone, otherwise, we might either return too many results, if you include the clock value used as a paginator you will get more, if you exclude it you might skip results). Both of these can be potentially exploited by an attacker to selectively hide messages or dos the app.

2) We should strive to minimize empty pages, for example offset/limit (which is also inefficient as offset/limit increases), will lead to many empty pages (all duplicates) if you are on a chat for some time, as newer messages coming in will be already inserted on the client and the page will shift. (cursor has similar problems in our case as messages can be inserted as well in the history, but it's less prominent). Keeping count of the new messages in the client side is what the old status-react did (adjusting offset/limit as new messages were coming in), but that complicates client side logic and is very fiddly (you need to also understand on which side of the page it will land etc).

There's also to consider how we fetch newer messages, currently we don't go through the database in status-react, so that's easy, if that changes at some point, we might need to adjust the pagination technique, and having an insertion timestamp/rowid etc might be useful.

adambabik commented 5 years ago

@cammellos please check out a solution using rowid. This not solves the last problem you mentioned that is:

There's also to consider how we fetch newer messages, currently we don't go through the database in status-react, so that's easy, if that changes at some point, we might need to adjust the pagination technique, and having an insertion timestamp/rowid etc might be useful.

In such a case, I believe we would need a separate method which would use creation_at column and rowid similarly to MessageByChatID method.

cammellos commented 5 years ago

@adambabik looking at it, I have the feeling that this does not work, for example, say pagination count is 2:

a) rowID 4 ClockValue 10 b) rowID 3 ClockValue 9 c) rowiD 2 ClockValue 8 d) rowID 1 ClockValue 7

First page: will return a, b and pagination info: ClockValue: 8, rowID: 2

Second page will correctly return c,d

So at first seems to work (correct me if I misunderstood how it works).

What happens now if we fetch messages from the mailserver, we have:

a) rowID 4 ClockValue 10 b) rowID 3 ClockValue 9 c) rowiD 2 ClockValue 8 d) rowID 1 ClockValue 7 NEWMESSAEG) rowID 10, ClockValue 5

Which will be skipped as ROWID is greater

Also this is a completely ok scenario and should be handled as well:

a) rowID 20 ClockValue 10 b) rowID 50 ClockValue 9 c) rowiD 70 ClockValue 8 d) rowID 80 ClockValue 7 (fetched from mailserver/out-of-order)

Using the method you described, pagination will be:

RowID: 70 ClockValue: 8

and next query will be:

                clock_value <= 8 (this works) AND
                rowid <= 70 // this does not 

seems like the method you proposed assumes insertion order and clock value are somehow coupled (it would work if we discarded any out-of-order message, which we don't as well because of mailservers), or I am missing something?

adambabik commented 5 years ago

@cammellos yeah, that was not the best idea. Check out the new one with a query-time column which is similar to ${fixed-length-clock-value}-${message-id} but instead of message-id, row-id is used. There will be a single column to order by.

Drawback of this solution is that if something gets in between subsequent calls, it will be returned. Not sure if it's a deal-breaker.

cammellos commented 5 years ago

it works but honestly I don't see any benefit of not having as an extra column, much more established pattern.

Basically it's just a cursor created a query time, we can't index it so we will be sorting over a non-indexed field, which will have an impact on performance (not sure whether the query planner is clever enough to optimize this), and adds query time complexity, for avoiding to store an extra immutable column (likely negligible in terms of space). What is exactly your concern here?

Also I think you might not be filtering in the code (I can see that you only sort but don't use the provided cursor to get the next page).

cammellos commented 5 years ago

Ah didn't see cursorWhere, forget the last point

cammellos commented 5 years ago

there's this that we could use for the index https://www.sqlite.org/expridx.html I guess, I still think we should keep it simple though and just go for a cursor column

EDIT: does not seem to work for rowid , no such column: rowid, looks like you can't explicitly index it, it is always implicitly indexed when you create an index as it's the natural order. Also ROWID needs to be as well 0 padded, otherwise 2 and 12 will be out of order, as we sort lexiconographically (scrap that, actually not true). MessageID (fixed length) is probably a better choice, even if you don't want to store it on disk.

adambabik commented 5 years ago

Basically it's just a cursor created a query time, we can't index it so we will be sorting over a non-indexed field, which will have an impact on performance (not sure whether the query planner is clever enough to optimize this), and adds query time complexity, for avoiding to store an extra immutable column (likely negligible in terms of space). What is exactly your concern here?

This is just more clear solution because cursor handling is encapsulated in one method and does not leak anywhere else. However, if to your knowledge, adding a cursor column is the safest option here, let's do that.

I think it needs to scan the same amount of rows and building it should be fast, however, if we had a dedicated cursor column and create an index that sorting should happen on the index, right? And then it might be a huge difference.

EDIT: does not seem to work for rowid , no such column: rowid, looks like you can't explicitly index it Yeah, I tried creating an index on rowid and it's not possible.

I don't think rowid needs to be padded because it is always increasing and we sort descending so longer strings should be at the top anyway.


I will proceed with a cursor column.

cammellos commented 5 years ago

I think it needs to scan the same amount of rows and building it should be fast, however, if we had a dedicated cursor column and create an index that sorting should happen on the index, right? And then it might be a huge difference.

I have run some benchmarks, either solution have the same performance, (12/15 ms to fetch a 10 rows with 1 million entries in the db), as long as there's an index covering them (in this case we would use messageID as rowid can't be indexed it seems), without an index single column is faster, as it would have to compute that for each entry, but it's not the case, as we can create an index:

create index blah on test(substr('0000000000000000000000000000000000000000000000000000000000000000' || clock_value, -64, 64) || message_id, chat_id);

there seem to be no performance difference, so if you feel strongly about having it a query time is fine by me, as in both cases it's just hitting the index.

adambabik commented 5 years ago

@cammellos the last few things I am not sure:

Other than that, I think it's the whole PR is ready to be reviewed.

cammellos commented 5 years ago

@adambabik yes unseen messages retrieves all the messages, not sure exactly why, but we can keep it as it is for now, similarly the from method