For large sets of data the downloaded survey results are incomplete

MgowanoJr commented 2 years ago

To replicate:

Go to South Africa Site
Navigate to Form Data
Open the Survey COVID-19 Health Worker Vaccine Survey (https://za.goodinternet.org/admin/forms/submissions/292/)
Download CSV of results

Expectation:

The dataset has 1180 Pages of 20 items which translates to a download with around 23,000 responses.

Actual:

The csv file download has around 4,800 responses only

cbunicef commented 2 years ago

A few notes to add on how this issue behaves:

When downloading XLSX file a 504 eventually appears and no file is downloaded
When downloading CSV file, the download starts and continues for very close to the 30 second timeout period, then completes without error
- The number of rows included varies, but there doesn't seem to ever be an incomplete row ie corrupted data

@istride @ChrisMarsh82

istride commented 2 years ago

I have increased the timeout from 30 to 60 seconds for the South Africa site. Let me know if that helps.

@cbunicef If this resolves the problem then it is not related to the app itself but a deployment issue, and I recommend this issue be closed and another raised in the DevOps project instead to apply the fix permanently for all sites.

cbunicef commented 2 years ago

When trying to download XLSX I now get:

An error occurred.
Sorry, the page you are looking for is currently unavailable.
Please try again later.

If you are the system administrator of this resource then you should check the error log for details.

Faithfully yours, nginx.

When downloading CSV the download stops at 30 seconds, with around 5K of 24K rows included.

istride commented 2 years ago

There are at least two other places were timeouts are set. After setting them all to 60 seconds the problem remains for the XLSX download, and the CSV has double the number of rows (12K) but is still incomplete.

I would not recommend increasing the timeout any further, and it would need to be two minutes for this particular survey - others may require more time. I don't think there is way to return an error response, in the case of CSV downloads, because the server has already given a successful response that then gets cut short.

Contrary to what I suggested before, I now think this issue requires more development of the app to resolve.

cbunicef commented 2 years ago

Thanks @istride. If we think the time it takes to run the query will have an impact on the API development then let's please consider that early.

ChrisMarsh82 commented 1 year ago

Change CSV to not stream. It will only download once fully complete. This will stop users from getting incomplete data and thinking they have a complete set data

Investigate: can we give a specific error when we get a timeout error in this area?

ChrisMarsh82 commented 1 year ago

Possible solution: downloads go to another area that users can download from once all data is collected

istride commented 1 year ago

Gunicorn is a low-level component that accepts HTTP requests and creates worker processes that run the IoGT app to service those requests. By default, Gunicorn uses "sync" workers that each run in their own process, handles a single request at a time, and are killed after a timeout threshold is reached, no matter what - this is why the CSV is cut short. Extending the timeout is not a scalable solution because there is always the risk that a larger export will exceed the timeout.

A better solution would be to use an asynchronous worker type that has the ability to cope with an operation that takes an undefined amount of time - like streaming a CSV file. Gunicorn has the option of using threads, which are like a lightweight process - many threads may exist within one process.

One benefit from threads is that requests can take longer than the worker timeout while notifying the master process that it is not frozen and should not be killed.

We can try the threaded approach to see if it solves this issue. I'll find out what changes are required to make this possible.

istride commented 1 year ago

It has been quite straightforward to configure threads on our sandbox instance, and I have not noticed any adverse effects so far. I have taken the liberty of configuring the South Africa instance in the same way, to see whether this issue still exists or not.

I'm glad to say that I managed to download what I think is the complete set of results (27,368 rows inc. header). The XLSX download was also successful, in that something was downloaded, however, there were only 23,601 rows in the Excel file, and I can't explain why that is.

There is another issue with the Excel download in that there is very little indication that anything is happening once the download button is clicked, which may lead a frustrated user to click again, and again, eventually occupying several worker threads with the same task simultaneously, possibly leading to downtime for the site, in general. I have no proof that this would happen, but in any case, I think it would be prudent to give more feedback to users about the download process.

cbunicef commented 1 year ago

Great to know that the threading has worked!

The resulting data is a strange case... when I downloaded the CSV file a few minutes ago I got 23,601 rows. When I downloaded the XLSX a few mins ago I got 23,601 as well. Based on the number of paginations shown in the Admin Panel (1180) and the rows per page (20, last page has all 20 rows), we would expect 23,601 rows including header. So I'm not sure why you got so many rows when you downloaded the CSV - but I'm also suspicious that we would have exactly 23,600 responses.

Now I am seeing that downloading the XLSX results in gateway timeout, CSV is still working.

For user feedback, the simplest solution might be spawning a new tab when the button is clicked, and the download begins there.

unicef / iogt

For large sets of data the downloaded survey results are incomplete #1309