remp2020 / remp

REMP - Reader's engagement and monetization platform. Set of open-source tools for publishers to engage with their audience. Repository is public mirror of our internal private repo.
https://remp2020.com
MIT License
125 stars 39 forks source link

BEAM client too aggressive #16

Closed j-norwood-young closed 5 years ago

j-norwood-young commented 5 years ago

The BEAM client fires every 5 seconds, using about 1.3kb per request. If a page is open for an hour, that would use up about 8mb of data. If someone leaves a page open for 24 hours in South Africa, it would cost a day's wages.

Possible solutions:

rootpd commented 5 years ago

Those requests are weirdly big. Considering the request that's currently being generated:

{"action":"timespent","timespent":{"seconds":30,"unload":false},"system":{"property_token":"1a8feb16-3e30-4f9b-bf74-20037ea8505a","time":"2019-07-23T11:55:38.399Z"},"user":{"id":"92363","browser_id":"a1a0cb38-4c7a-49f5-b1d6-3246c5f4ae73","subscriber":true,"url":"https://dennikn.sk/","referer":"","adblock":false,"window_height":1050,"window_width":1920,"cookies":true,"websockets":true,"source":{},"remp_session_id":"72c73e09-ed47-4259-b35f-4599d400ba41","remp_pageview_id":"dfa61927-d5c7-4084-bbf2-7a367fe30cf0"}}

It's 516 bytes uncompressed and developer tools report this as 171B sent over network (probably gzip compression). Would you share your requests so we can check why they're so big?

About why it's this way.

Having things designed this way was a simplicity tradeoff - we actually didn't wanted for Tracker to contain logic or maintain information about data/pageviews being tracked. Both of that would be necessary if we wanted to use websockets or calculate time spent server-side. Tracker is supposed to be dummy validator which just checks whether the data looks OK and passes it to Kafka. Any restart or load balancing would also cause issues for that scenarios.

Because of all of mentioned, the only possible solution here is to make the interval configurable.

Btw. internally it uses logarithmic function which prolongs the interval longer your page is opened. After an hour, the update is being sent only once every 90 seconds. https://github.com/remp2020/remp/blob/master/Beam/resources/assets/js/remplib.js#L736

The implemented configuration will therefore change the initial interval and the log function will remain there to keep the interval raising in time.

j-norwood-young commented 5 years ago

Here's an example request payload:

{"article":{"id":"368362","author_id":"Marianne Merten","tags":[],"variants":{}},"action":"load","system":{"property_token":"5478a41d-bac1-4679-8a53-4201bb4294f8","time":"2019-07-23T12:15:08.298Z"},"user":{"id":"18510","browser_id":"c8ff2969-ce3f-447c-a579-9bb5e08cd720","url":"https://www.dailymaverick.co.za/article/2019-07-23-the-never-ending-story-of-eskom-bailouts-mboweni-introduces-special-bill-of-billions-more/","referer":"https://www.dailymaverick.co.za/","user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0","adblock":false,"window_height":777,"window_width":1280,"cookies":true,"websockets":true,"source":{},"remp_session_id":"3e5db5a9-6bf6-4be2-8fa9-999be6067d8c","remp_pageview_id":"7bdf244c-012f-4ecf-ab38-63380064eef0"}}

That's 782 chars, according to wc.

Response header is 239B, so 1021B per request and response.

Firefox is reporting around 1.3kb consistently of what it calls "Traffic", which is about 400B mysteriously being used.

Chrome reports around 340B per request.

CURL says: "upload completely sent off: 781 out of 781 bytes" (weirdly a byte short from character count - but could be a changed second digit or something.)

Of course the ISP doesn't care about payload size - it just cares about total traffic, which includes DNS lookups, frame headers, checksum etc.

The technicalities don't really matter. The issue really is that some markets are much more sensitive to bandwidth usage than others, due to income/bandwidth cost inequalities. (In SA, a full day's wages for a domestic worker will not even buy you 200MB out-of-bundle data.)

A configurable interval will help alleviate this issue. I'd be happy if I could start at 10s interval instead of 5, giving up on granularity in favour of less impact on our users. In Europe, it's much less of an issue.

Glad to hear about the logarithmic function!

rootpd commented 5 years ago

My bad here, I was counting only payload and completely forgot the headers :). Anyway, I understand the point about the traffic limitations, we'll make the configuration happen.

rootpd commented 5 years ago

One more reason why timespent needs to be sent by frontend (for anyone reading this in the future): The timespent timer is paused once the user switches the tab to different one and reenabled when she gets back. This behavior is only observable if frontend JS library handles that, server-side calculation wouldn't be able to include this.

rootpd commented 5 years ago

Hey. It's in the master and will be in the tagged version soon. The JS snippet was changed from:

timeSpentEnabled: true // defaults to false

to:

timeSpent: {
    enabled: true, // defaults to false
    interval: 20 // defaults to 5
}
rootpd commented 5 years ago

https://github.com/remp2020/remp/commit/11137ff02d6a707406d98068a5a781fab5b155a1