nightsprout / iom

IOM
2 stars 0 forks source link

Batch Download Activity Data #44

Closed EStrunz closed 9 years ago

EStrunz commented 9 years ago

A few months ago, we encountered some weirdness with batch downloads of activity data (e.g., in CSV, XLS, and KML formats). Trying to download data in batch was slow and seemed to be crashing the system. As a quick fix, we only allowed downloads for registered users and disabled downloads for the entire, global list of activities.

Unfortunately, this process still isn't working smoothly. I tried downloading the activity data for one of the larger group of activities and hit an error. Plus, in the future, it will be important to allow registered users to download the entire activity list.

To reproduce the error I just mentioned:

  1. Login as a user to the Partners Map
  2. Navigate to the Neglected Tropical Diseases sector (http://www.partnersmap.org/sectors/1)
  3. Download list (I tried as XLS, http://www.partnersmap.org/sectors/1.xls)

Let me know what you think!

dtpowl commented 9 years ago

Currently, the software is attempting to generate these files on demand for each request. We should be able to improve file download performance dramatically by periodically generating the files in a background process and caching them.

vitchell commented 9 years ago

I was thinking something maybe more complex but I'm not sure. Instead of precaching, just use the links to run a job that will email a generated file. In order to see and click a link, someone has to be an authenticated user, so we have their address.

Thoughts?

On Apr 27, 2015, at 10:53 AM, Daniel Powell notifications@github.com wrote:

Currently, the software is attempting to generate these files on demand for each request. We should be able to improve file download performance dramatically by periodically generating the files in a background process and caching them.

— Reply to this email directly or view it on GitHub.

EStrunz commented 9 years ago

Ideally, there is an added benefit for allowing even non-registered users to download the data. But if it would impact performance significantly, I wouldn't consider it a critical consideration.

On Mon, Apr 27, 2015 at 11:11 AM, Mitchell Lane notifications@github.com wrote:

I was thinking something maybe more complex but I'm not sure. Instead of precaching, just use the links to run a job that will email a generated file. In order to see and click a link, someone has to be an authenticated user, so we have their address.

Thoughts?

On Apr 27, 2015, at 10:53 AM, Daniel Powell notifications@github.com wrote:

Currently, the software is attempting to generate these files on demand for each request. We should be able to improve file download performance dramatically by periodically generating the files in a background process and caching them.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/nightsprout/iom/issues/44#issuecomment-96695450.

dtpowl commented 9 years ago

The precaching method does have the advantage of allowing nonregistered users to download the data, and would offer users a slightly smoother experience in that they could still download the file directly in their browser instead of having it emailed to them.

The disadvantage is primarily that it might slightly increase hosting costs: If we spun up a Heroku 1X worker dyno for this and had it running 24/7, it would add an additional $1.20 per day to the site's hosting costs. The "generate and email" approach would also have an additional cost, but it might be smaller if the number of requests per day is low.

It is possible that database performance will be slightly degraded while the file generation job is running, but I can't predict how big the impact will be until we try it. This issue will be present with both approaches, but if the number of requests per day is low, the window of time during which performance suffers may be smaller with the "generate and email" approach, and wider with the "precaching" approach. On the other hand, the precaching approach would allow us more control over when the job runs, so we'd have the option of mitigating the performance impact by scheduling the jobs to run only when demand is low.

Another point to consider is how frequently the data updates, and how important it is that the report contain all of the latest data. If it's okay for the reports to omit data added within the last 24 hours, then with the precaching approach we'd only need to run the jobs once per day. If it's important that the reports are fully up to date, then we'd need to leave the jobs running full-time.

dtpowl commented 9 years ago

The email method would allow fully up-to-date reports without incurring additional computation time.

vitchell commented 9 years ago

@EStrunz it won't impact performance per se.

The tradeoff here is, basically, that if we do the method that pre-generate the files, the files will not always be 100% fresh, and there will be associated costs with storing all the files. But anyone will be able to download the file, and it will execute immediately.

If we create the files when a link is clicked and send via email, we can only send them to registered users and it will take a few minutes to create and send a file. But the files will always be up-to-date and it will have a lower total cost of ownership.

EStrunz commented 9 years ago

Excellent, thanks for this great info. Both strategies sound viable to me, though the "generate and e-mail" approach has the edge from my current perspective. That seems like the best path forward, but I'll defer to your judgment!

On Mon, Apr 27, 2015 at 1:54 PM, Mitchell Lane notifications@github.com wrote:

@EStrunz https://github.com/EStrunz it won't impact performance per se.

The tradeoff here is, basically, that if we do the method that pre-generate the files, the files will not always be 100% fresh, and there will be associated costs with storing all the files. But anyone will be able to download the file, and it will execute immediately.

If we create the files when a link is clicked and send via email, we can only send them to registered users and it will take a few minutes to create and send a file. But the files will always be up-to-date and it will have a lower total cost of ownership.

— Reply to this email directly or view it on GitHub https://github.com/nightsprout/iom/issues/44#issuecomment-96757771.

vitchell commented 9 years ago

@EStrunz Sounds good. The reality is that switching between the two won't take much work. The bulk of the development time here is building a job that creates the file in the background, and that's required for either path. Lets start with the email system and if you want to modify it, we can make that change without much work.

dtpowl commented 9 years ago

:+1:

vitchell commented 9 years ago

@EStrunz I went ahead and merged the functionality and pushed it to production. I'm double-checking that everything works right now, but I've not turned up anything in particular. If you see any problems, please re-open this issue (or submit a new one) and let us know.

EStrunz commented 9 years ago

Will do. I'll do some digging today to make sure everything looks good.

On Fri, May 1, 2015 at 7:09 PM, Mitchell Lane notifications@github.com wrote:

@EStrunz https://github.com/EStrunz I went ahead and merged the functionality and pushed it to production. I'm double-checking that everything works right now, but I've not turned up anything in particular. If you see any problems, please re-open this issue (or submit a new one) and let us know.

— Reply to this email directly or view it on GitHub https://github.com/nightsprout/iom/issues/44#issuecomment-98259231.

EStrunz commented 9 years ago

Just did some quick testing and I hit a couple of problems (unfortunately):

  1. I tried to export from the Neglected Tropical Diseases sector (http://www.partnersmap.org/sectors/1). The pop-up window triggered for me. Afterwards, however, the site hit an application error and was unavailable for a few minutes. I was logged in as an administrator.
  2. I attempted exporting activities from the admin area, but hit an application error. No pop-up confirmation there. It doesn't look like the admin area export feature is plugging into the new export method.
  3. Assuming we can fix those two hiccups, can we re-insert an option to export all data from the front page of the Partners Map?

I haven't received the exported data yet via e-mail, but it's only been a few minutes. I'll update after I receive it.

Thanks for the great work on this! It'll be amazing to get this working smoothly.

vitchell commented 9 years ago

And other things worked out normally?

EStrunz commented 9 years ago

Just got the spreadsheet. It looks good!

There's only a small typo in the e-mail (screenshot below).

email

The bottom link should be updated to partnersmap.org.

Everything else seems to be working smoothly on the site!

EStrunz commented 9 years ago

Actually, another small change to the template notification message. The "from" e-mail should be info@partnersmap.org instead of cww@taskforce.org. The name field can be "Partners Map"

vitchell commented 9 years ago

@EStrunz Alrighty, I change the email, enabled the export functionality on the admin side (I believe I got all of these use cases, but if you see one that I missed let me know), and enabled downloading on the main projects page.

I couldn't reproduce the site going down, though I'm still looking into it.

vitchell commented 9 years ago

@EStrunz Leaving this open until you follow up with a thumbs up or more issues.

EStrunz commented 9 years ago

Looks good from this end!