Bulk download via `/api/bulk_download/:id` API endpoint fails

eecavanna commented 3 months ago

An end user reported this issue. When they tried to perform a "bulk download" via the Data Portal UI in Google Chrome, they got a Cloudflare-branded error page with code 524 (i.e. "a timeout occurred") on it.

A team member and I reproduced the issue on our own computers and, in addition, failed to get the "bulk download" feature to work with any data set.

Steps to reproduce

Go to https://data.microbiomedata.org/
Sign in with your ORCID credentials
In the search box on the left, type Colonization resistance against Candida and, in the "Query Options" list that appears below it, click the item that says Name (Study)
- This step is optional, but reduces the size of the download for demonstration purposes
Once the search results load, scroll down to the "Samples" section
In the "Bulk Download" panel, under "Reads QC Analysis Activity", mark the checkbox next to "QC Statistics"
- I think the item(s) you choose are arbitrary, with respect to demonstrating this issue
Click the "Download ZIP" button
In the modal window that appears, click the "Accept and Continue to Download" button
See error:
1. If using Chrome: Notice that a Cloudflare-branded error page eventually appears, showing the error code of 524
2. If using Brave: In DevTools, notice that the status of the HTTP request associated with the bulk download remains at (pending) for exactly 1 minute (suspiciously specific), then changes to 200 and prompts you to download a file named archive. If you download the file, Brave's "Downloads" list shows a "Failed Network error" message.

References

This Slack conversation, which began on Monday, April 1, 2024

eecavanna commented 3 months ago

The API endpoint is implemented here:

https://github.com/microbiomedata/nmdc-server/blob/0d645788f88ecfa6255be95f4fabc9425094999a/nmdc_server/api.py#L593-L610

eecavanna commented 3 months ago

I used curl to try to dump more information from the HTTP client's perspective...

$ curl -vvv \
  --cookie "session=$DATA_PORTAL_SESSION_COOKIE" \
  --output localfile \
  https://data.microbiomedata.org/api/bulk_download/8c66069a-(...)

...and here's what curl dumped after 1 minute:

* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
  0 2361k    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0
* Connection #0 to host data.microbiomedata.org left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

naglepuff commented 3 months ago

It seems that the endpoint doesn't always fail, which is good in that it gives us a hint at what might be happening. For example, bulk downloading data from certain studies, like gold:Gs0110138 works fine.

My first thought is related to the code comment here. The NGINX plugin needs to treat all URLs for an archive as local ones, and we've set up proxies for certain remote hosts. With the influx of data over the past year, and this proxy configuration being unchanged, I decided to investigate the current state of our data objects collection.

-- First I'd like to know if we have data objects whose URLs don't match our configured proxies
select count(*) from data_object where url not like 'https://nmdcdemo.emsl.pnnl.gov%' and url not like 'https://data.microbiomedata.org%';
-- 6565

-- So we have some number of data objects that mod_zip probably can't find.
-- To find all domains used by data objects:
select distinct regexp_replace(url, '^https?://([^/]+).*$', '\1') as domain from data_object;
-- domain
--
-- storage.neonscience.org 
-- data.microbiomedata.org 
-- portal.nersc.gov
-- nmdcdemo.emsl.pnnl.gov
-- (5 rows) <-- note that we have a blank row representing null URL

I suspect that adding storage.neonscience.org and portal.nersc.gov to the list of proxies in the python code and nginx configuration will solve the problem.

ssarrafan commented 3 months ago

Mike fixed this issue. Thank you Mike! Discussed adding refresh of cache to release schedule with Eric.
From Slack: The frontend pod(s) for dev are only 2 hours old, while the corresponding pods for prod are ~5 days old. I do vaguely recall that in the past when we've seen issues with bulk download the culprit has been on the rancher/cloudflare side. It might also involve the portal's data-proxy service.

eecavanna commented 3 months ago

To resolve this issue today, he:

redeployed the portal-frontend service in the nmdc (production) namespace.

Source: Slack

microbiomedata / nmdc-server

Bulk download via `/api/bulk_download/:id` API endpoint fails #1204

Steps to reproduce

References