matrix-org / matrix-python-sdk

Matrix Client-Server SDK for Python 2 and 3
Apache License 2.0
255 stars 119 forks source link

Fix missing update of start_token while backfill prev messages #284

Open mattthias opened 6 years ago

mattthias commented 6 years ago

Hello,

while writing a script that should parse the whole channel history i wondered why i always got the same messages when i called backfill_previous_messages(). Please see the example code which i run with a room that just contains 30 messages (message 1 ... message 30).

>>> from matrix_client.client import MatrixClient
>>> client = MatrixClient("https://synapse-server")
>>> token = client.login_with_password(username="mattthias", password="guesswhat")
>>> room = client.join_room("#matrixtest:synapse-server")
>>> print(room.events)
[]
>>> room.backfill_previous_messages()
>>> for event in room.events:
...   print(event['content'])
... 
{'body': 'Message 21', 'msgtype': 'm.text'}
{'body': 'Message 22', 'msgtype': 'm.text'}
{'body': 'Message 25', 'msgtype': 'm.text'}
{'body': 'Message 26', 'msgtype': 'm.text'}
{'body': 'Message 23', 'msgtype': 'm.text'}
{'body': 'Message 27', 'msgtype': 'm.text'}
{'body': 'Message 24', 'msgtype': 'm.text'}
{'body': 'Message 28', 'msgtype': 'm.text'}
{'body': 'Message 29', 'msgtype': 'm.text'}
{'body': 'Message 30', 'msgtype': 'm.text'}
>>> room.backfill_previous_messages()
>>> len(room.events)
20
>>> for event in room.events:
...     print(event['content'])
... 
{'body': 'Message 21', 'msgtype': 'm.text'}
{'body': 'Message 22', 'msgtype': 'm.text'}
{'body': 'Message 25', 'msgtype': 'm.text'}
{'body': 'Message 26', 'msgtype': 'm.text'}
{'body': 'Message 23', 'msgtype': 'm.text'}
{'body': 'Message 27', 'msgtype': 'm.text'}
{'body': 'Message 24', 'msgtype': 'm.text'}
{'body': 'Message 28', 'msgtype': 'm.text'}
{'body': 'Message 29', 'msgtype': 'm.text'}
{'body': 'Message 30', 'msgtype': 'm.text'}
{'body': 'Message 21', 'msgtype': 'm.text'}
{'body': 'Message 22', 'msgtype': 'm.text'}
{'body': 'Message 25', 'msgtype': 'm.text'}
{'body': 'Message 26', 'msgtype': 'm.text'}
{'body': 'Message 23', 'msgtype': 'm.text'}
{'body': 'Message 27', 'msgtype': 'm.text'}
{'body': 'Message 24', 'msgtype': 'm.text'}
{'body': 'Message 28', 'msgtype': 'm.text'}
{'body': 'Message 29', 'msgtype': 'm.text'}
{'body': 'Message 30', 'msgtype': 'm.text'}
>>> 

The function backfill_previous_messages() updates the room's event list (room.events) using the api function "client.api.get_room_messages". The latter function takes a start token as parameter to know from where in the history of events to start.

backfill_previous_messages() handes over self.prev_batch as start token but never updates it. So repeated calls to backfill_previous_messages() always return the same chunk of events from the room's event history.

This commit fixes this wrong behavior by updating self.prev_batch on every function call.

Signed-off-by: Matthias Schmitz matthias@sigxcpu.org

Zil0 commented 5 years ago

Thanks for looking at this! It already came up that this behavior needs fixing.

This is trickier than it seems though, since to me it looks like the update of prev_batch that is done here is not consistent with the one done in client.py. Before, prev_batch was the point before the last messages we received in the last sync. With this, prev_batch can be either that, or any previous token depending on how many times backfilling was done.

There is a very basic case were an issue arises in an obvious way:

I'm not 100% sure what is the proper fix. My intuition is that the prev_batch returned when backfilling shouldn't end up in the room state, but handed back to the user, and that a way to provide it to backfill_previous_messages should exist in order to retrieve even earlier messages. Possibly via an optional argument, which would default to the token obtained via sync if not present.

Feel free to try different approaches, I'm really interested in seeing this part of the SDK improved :)

remram44 commented 5 years ago

I agree that backfilling should be completely independent. It shouldn't affect the sync state, but it probably shouldn't be calling the listeners either, since otherwise the listeners get events out-of-order. Worse, there is no way to distinguish a new message from the ones you get from the backfill request.

Currently, I avoid using backfill_previous_messages entirely, and call get_room_messages manually so I don't mess anything up.