Prevent multiple loading of items

bastei commented 10 years ago

The current behavior of syncing via API is the following:

1. API sync: load new item A, remember "lastModified"-timestamp
cahnge read or starred status of item A (on the same device or elsewhere)
1. API sync: the whole item A loads again, as it was modified after the remembered timestamp

The result is that every item is loaded everytime its read/starred status changes. As every item is set to "read" at anytime, minimum loading is twice. That's not good for mobile internet connections.

Here is my suggestion: Enable the client to say: "Hey server, give me your updates since yesterday, 3 pm. Oh and please don't send me the items older than the one with id 42 again, I already have them. Only give me the status-information of them."

The /items/updated route of the API gets another parameter "lastKnownId", which is the last item-id the client knows of. For every item that has a lower-equal id the server only sends the "unread", "starred" and "lastModified" parameters, since all the other parameter will never change in the life of an item.

Here is the example:

Items: id: 1, lastModified: 10000 id: 2, lastModified: 30000 id: 3, lastModified: 15000 id: 4, lastModified: 20000

API Request: items/updated?format=json&type=3&id=0&lastModified=12&lastKnownId=3

API Response:

{
    "items": [

    // first all the items with id <= lastKnownId and item.lastModified > request.lastModified
    // these items are already on the device, so only send fields that cpuld have changed
    {
        "id": 2,
        "unread": false,
        "starred": true,
        "lastModified": 30000
    }, {
        "id": 3,
        "unread": true,
        "starred": false,
        "lastModified": 15000
    },

    // let's send the new items with id > lastKnownId and item.lastModified > request.lastModified
    {
        "id": 4,
        "guid": "http://www.unique.url",
        "guidHash": "88b150",
        "url": "http://www.item.url",
        "title": "Item title",
        "author": "Item author",
        "pubDate": 19,
        "body": "This is the items text. It is so huge, you don't want to load it multiple times! :)",
        "enclosureMime": null,
        "enclosureLink": null,
        "feedId": 1,
        "unread": true,
        "starred": false,
        "lastModified": 20000
    }]
}

BernhardPosselt commented 10 years ago

For every item that has a lower-equal id the server only sends the "unread", "starred" and "lastModified" parameters, since all the other parameter will never change in the life of an item.

Thats true for now but does not have to be in the future.

And its true that there will always be a result if you make the request, because a timestamp in seconds is too vague to truly distinct between changed items. A solution would be to also save miliseconds, and use that as timestamp, then we can modify the db query to be > rather than >= timestamp. Clients should also handle this well and migration in general should not be a problem. see http://at2.php.net/microtime

The other more correct and harder solution would be to log an update for each change and use the id of the update to tell the client what to refetch. Chances are this is will be way slower though (2 requests)

Thats my suggestion for the problem :)

bastei commented 10 years ago

For every item that has a lower-equal id the server only sends the "unread", "starred" and "lastModified" parameters, since all the other parameter will never change in the life of an item.

Thats true for now but does not have to be in the future.

Ok, that's right. However I would separate the information of an item:

information that is retrieved from a feed or website
information that represents the interaction of the user

As the latter is the way smaller part but can change often it is OK to transfer it multiple times. The further doesn't change often (I can only imagine an update of the feed). I would give it a new id if the core information of the item changes.

I think in the end this change can be done with few effort but has a big outcome.

A solution would be to also save miliseconds, and use that as timestamp, then we can modify the db query to be > rather than >= timestamp.

Either I don't get your point or you didn't get mine. ;) I'll try to refine the examples.

log an update for each change

Again, I don't think that would be a solution for my problem. If you want to distinguish which client knows which items to only send the essential data, you would have to save one log for every client.

BernhardPosselt commented 10 years ago

Oh i see so your point is essentially to prevent redownloading because client updates also change the timestamp. I think that the last known id would add more to confusion for API users.

Common patterns for this is to let the user exclude fields like ?exclude=body but then again theres no way to tell if the body was changed.

All in all I'd say we'd need to ship diffs to get the size down to a minimum. Much like Git, the API user would get an ID of a change when he requests all items. Then subsequently he calls /items/updated/diff/3 for instance and all the changes from 3 to the current id will be returned as minimal JSON and also provide the diff id that was used to produce that diff.

But again this would be quite a bit of work to implement, you cant clean it without problems and the db will explode in size quite quickly (you cant delete diffs, so cleanup job would be impossible)

cc @fossxplorer

BernhardPosselt commented 10 years ago

cc @David-Development @phedlund @ikacikac

BernhardPosselt commented 10 years ago

Another idea i just got: What if we keep the diffs for a week (or configurable timespan). If the client requests a diff that doesnt exist, it gets a 404 and can call the /items/updated url like it used before (and also gets the current diff id).

Then for all clients that updated once a week the update size would be reduced to a pure minimum.

Update from a fresh install would be like this

GET /items?type=3 -> 200

{
  "items": [
    {
      "id": 3443,
      "guid": "http://grulja.wordpress.com/?p=76",
      "guidHash": "3059047a572cd9cd5d0bf645faffd077",
      "url": "http://grulja.wordpress.com/2013/04/29/plasma-nm-after-the-solid-sprint/",
      "title": "Plasma-nm after the solid sprint",
      "author": "Jan Grulich (grulja)",
      "pubDate": 1367270544,
      "body": "<p>At first I have to say...</p>",
      "enclosureMime": null,
      "enclosureLink": null,
      "feedId": 67,
      "unread": true,
      "starred": false,
      "lastModified": 1367273003
    }, // etc
  ],
  "diffId": 40
}

Update case 1:

GET /items/diff/40 -> 200


{
  "items": [
    {
      "id": 3443,
      "guid": "http://grulja.wordpress.com/?p=76",
      "guidHash": "3059047a572cd9cd5d0bf645faffd077",
      "url": "http://grulja.wordpress.com/2013/04/29/plasma-nm-after-the-solid-sprint/",
      "title": "Plasma-nm after the solid sprint",
      "author": "Jan Grulich (grulja)",
      "pubDate": 1367270544,
      "body": "<p>At first I have to say...</p>",
      "enclosureMime": null,
      "enclosureLink": null,
      "feedId": 67,
      "unread": true,
      "starred": false,
      "lastModified": 1367273003
    },
    {
      "id": 3443,
      "unread": true,
      "starred": false,
      "lastModified": 1367273003
    }, 
  ],
  "diffId": 45
}

Update case 2 (diff has been deleted): GET /items/diff/40 -> 404

{}

GET /items/updated?lastModified=1367273003

{
  "items": [
    {
      "id": 3443,
      "guid": "http://grulja.wordpress.com/?p=76",
      "guidHash": "3059047a572cd9cd5d0bf645faffd077",
      "url": "http://grulja.wordpress.com/2013/04/29/plasma-nm-after-the-solid-sprint/",
      "title": "Plasma-nm after the solid sprint",
      "author": "Jan Grulich (grulja)",
      "pubDate": 1367270544,
      "body": "<p>At first I have to say...</p>",
      "enclosureMime": null,
      "enclosureLink": null,
      "feedId": 67,
      "unread": true,
      "starred": false,
      "lastModified": 1367273003
    }, // etc
  ],
  "diffId": 45
}

This would also be backwards compatible btw

David-Development commented 10 years ago

I think it would be difficult to implement this function, even for the clients. Furthermore a good sync interval for the client would be more helpful. E.g. my ownCloud News Reader App for Android is able to trigger a sync every 24 hours (yes it's not perfect but I'm going to improve this feature soon). So for example my News App is downloading all my news including the images when I'm at home (at night). So on my way to the university it's possible to read the feeds without internet connection.

bastei commented 10 years ago

a good sync interval for the client would be more helpful

Sorry, that is maybe helpful for personally you. We are speaking about a major problem of data synchronization. Sending the same data multiple times is simply bad, especially since we have article enhancers that empower the user to get full sized articles.

you cant clean it without problems and the db will explode in size

@Raydiation What kind of diffs do you like to save? I think it is enough to save "field x of item y was changed (at time z)". The changed data itself will be saved in the item table anyway.? If you delete an item there is no problem to delete the related log records then!?

BernhardPosselt commented 10 years ago

news_items_diffs

| id | user | timestamp |

news_items_diffs_fields

| diff_id | field | item_id |

BernhardPosselt commented 8 years ago

Moving this to the API 2.0 ticket: https://github.com/owncloud/news/issues/862

For identifying similar posts we now compute a hash over all relevant fields (content, url, title, enclosures). The sync client will also receive the fingerprints and send them to the server when syncing. If the fingerprint of the item that should be marked read/starred matches the one in the database, we can detect that we only need to send the actual status change.

This is a more robust, faster and easier way to fix this issue :)

owncloud-archive / news

Prevent multiple loading of items #458