publiclab / plots2

a collaborative knowledge-exchange platform in Rails; we welcome first-time contributors! :balloon:
https://publiclab.org
GNU General Public License v3.0
956 stars 1.83k forks source link

Stats Download And Site Overload #5524

Open skilfullycurled opened 5 years ago

skilfullycurled commented 5 years ago

Hi, last night, after we had sorted out the date issue with #5490, I went to download the rest of the data. While I was downloading 2/26/13 - 07/01/16, all of the downloads went smoothly, except when I downloaded the users, it took a very long time (relative to other downloads of users I've done) and it broke the site. (For what it's worth to diagnosis, before I did that, I had tried to download 1/10/2010 - 4/24/13).

Anyway, after the site went back up, I decided to try a smaller date, 4/26/13 - 1/1/14 (7 mo). Everything worked as expected but I noticed something about the download. The file I downloaded a while back for 2.5 years (7/1/16 - ~4/2019 is 71MB), but the file for 6 mo. was 91MB for just those 6 mo! The JSON file might have even been more. I didn't download it but my experience has been that they can be larger just by their nature.

Aside from figuring out the downloads issue, this has led me to wonder if it would be better to simply have zipped files prepared by year for large downloads. If people want the whole archive, they download each year, and then use the interface to download the rest of the year they are in. A zipped version of the 91MB CSV brought it down to 31MB.

(Forgot to mention #3498 which is where the larger conversation about the stats feature is taking place.)

skilfullycurled commented 5 years ago

Some other info: Basically, sometime between 4/26/13 and 1/1/14 we either became immensely popular, or we had a ridiculously large number of spam signups.

Some figures:

1/1/2013: 1356998400 4/24/2013: 1366847999 UID Range: 59296 - 59296 Users: 12020

4/25/2013: 1366848000 4/25/2013: 1366934399 UID Range: 59297 - 59626 Users: 330

4/26/2013: 1366934400 1/1/2014: 1388534400 UID Range: 59627 - 420114 Users: 360466

1/1/14: 1388534400 1/24/14: 1398297600 UID Range: 420115- 422688 Total: 2572

jywarren commented 5 years ago

Just on the performance/slowness portion, it could be useful to look at https://oss.skylight.io/app/applications/GZDPChmcfm1Q/recent/6h/endpoints and see if it lines up with your queries? and looping in @icarito too!

On Thu, Apr 18, 2019 at 1:48 PM skilfullycurled notifications@github.com wrote:

Some other info: Basically, sometime between 4/26/13 and 1/1/14 we either became immensely popular, or we had a ridiculously large number of spam signups.

Some figures:

1/1/2013: 1356998400 4/24/2013: 1366847999 UID Range: 59296 - 59296 Users: 12020

4/25/2013: 1366848000 4/25/2013: 1366934399 UID Range: 59297 - 59626 Users: 330

4/26/2013: 1366934400 1/1/2014: 1388534400 UID Range: 59627 - 420114 Users: 360466

1/1/14: 1388534400 1/24/14: 1398297600 UID Range: 420115- 422688 Total: 2572

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5524#issuecomment-484611290, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAF6J7GT24YVORQGITXSM3PRCX53ANCNFSM4HG6XQHA .

jywarren commented 5 years ago

Hmm, that could have been either in the final days of the Drupal site, or before/after some change in our login sequence!

On Thu, Apr 18, 2019 at 4:04 PM Jeffrey Warren jeff@unterbahn.com wrote:

Just on the performance/slowness portion, it could be useful to look at https://oss.skylight.io/app/applications/GZDPChmcfm1Q/recent/6h/endpoints and see if it lines up with your queries? and looping in @icarito too!

On Thu, Apr 18, 2019 at 1:48 PM skilfullycurled notifications@github.com wrote:

Some other info: Basically, sometime between 4/26/13 and 1/1/14 we either became immensely popular, or we had a ridiculously large number of spam signups.

Some figures:

1/1/2013: 1356998400 4/24/2013: 1366847999 UID Range: 59296 - 59296 Users: 12020

4/25/2013: 1366848000 4/25/2013: 1366934399 UID Range: 59297 - 59626 Users: 330

4/26/2013: 1366934400 1/1/2014: 1388534400 UID Range: 59627 - 420114 Users: 360466

1/1/14: 1388534400 1/24/14: 1398297600 UID Range: 420115- 422688 Total: 2572

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/publiclab/plots2/issues/5524#issuecomment-484611290, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAF6J7GT24YVORQGITXSM3PRCX53ANCNFSM4HG6XQHA .

skilfullycurled commented 5 years ago

Here's the time period.

Maybe this request (2.1 min) was for searching/aggregating the users so the website charts and figures could be updated, and then this request (7.4 min) was the csv download?

skilfullycurled commented 5 years ago

Hi everyone, circling back on this since I'm doing some planning on some work I'd like to try to do this summer.

This doesn't replace the caching issue, but I thought one way to get around download overload is by creating pre-made csv/json files for every six months. It's not like the data is going to change.

If people are into it, should I make a new issue or keep it here? I could use some discussion around implementation and how to break it down into steps.

skilfullycurled commented 5 years ago

Bringing in @icarito as comeuppance for (not entirely unfounded) accusations of stats misuse on the 27th of May, 2019. ; )

Kidding aside, wondering about the idea of pre-packaged 6 mo json/csv's downloads. This doesn't take care of the other problem of when someone just wants to view large sets of data which I've brought into the discussion on here. Even if it's a reasonable time period, choosing one that happens to include an unusually large set of data, may still overload the site.

Side question, how are we to test solutions (even on unstable) which tend to break the site without breaking the site?

jywarren commented 5 years ago

Re: testing, what are the drawbacks of testing on stable/unstable, even to the point of breaking those sites? Thanks!

skilfullycurled commented 5 years ago

As I am not the one who will have to restart the sites (cough, cough, @icarito eh-hem, sorry got something stuck in my throat) I don't know. Having said that, I wrote this issue when we had less information about rsessions and the spam discussion. So, testing may not be an issue once rsessions is removed. https://github.com/publiclab/plots2/issues/5817#issuecomment-502913354

We'll have an opportunity to find out since we (@cesswairimu and I) weren't sure if it was just the date issue or the large user issue as well that was giving her trouble with the "all time" query #5904. I'm going to be gone for the next two days but I'm adding @cesswairimu to #5817 and as soon as @icarito is finished then she can give it a try...?

If there's still a problem then we can be more aggressive on planning the removal of spam users from the chunk of ~350,000 and see how that helps with the overload issue.