microbiomedata / nmdc-server

Data portal client and server for NMDC.
https://data.microbiomedata.org
Other
9 stars 0 forks source link

Bulk download via `/api/bulk_download/:id` API endpoint fails #1204

Closed eecavanna closed 3 months ago

eecavanna commented 3 months ago

An end user reported this issue. When they tried to perform a "bulk download" via the Data Portal UI in Google Chrome, they got a Cloudflare-branded error page with code 524 (i.e. "a timeout occurred") on it.

image

A team member and I reproduced the issue on our own computers and, in addition, failed to get the "bulk download" feature to work with any data set.

Steps to reproduce

  1. Go to https://data.microbiomedata.org/
  2. Sign in with your ORCID credentials
  3. In the search box on the left, type Colonization resistance against Candida and, in the "Query Options" list that appears below it, click the item that says Name (Study)
    • This step is optional, but reduces the size of the download for demonstration purposes
  4. Once the search results load, scroll down to the "Samples" section
  5. In the "Bulk Download" panel, under "Reads QC Analysis Activity", mark the checkbox next to "QC Statistics"
    • I think the item(s) you choose are arbitrary, with respect to demonstrating this issue
  6. Click the "Download ZIP" button image
  7. In the modal window that appears, click the "Accept and Continue to Download" button
  8. See error:
    1. If using Chrome: Notice that a Cloudflare-branded error page eventually appears, showing the error code of 524
    2. If using Brave: In DevTools, notice that the status of the HTTP request associated with the bulk download remains at (pending) for exactly 1 minute (suspiciously specific), then changes to 200 and prompts you to download a file named archive. If you download the file, Brave's "Downloads" list shows a "Failed Network error" message. image image

References

eecavanna commented 3 months ago

The API endpoint is implemented here:

https://github.com/microbiomedata/nmdc-server/blob/0d645788f88ecfa6255be95f4fabc9425094999a/nmdc_server/api.py#L593-L610

eecavanna commented 3 months ago

I used curl to try to dump more information from the HTTP client's perspective...

$ curl -vvv \
  --cookie "session=$DATA_PORTAL_SESSION_COOKIE" \
  --output localfile \
  https://data.microbiomedata.org/api/bulk_download/8c66069a-(...)

...and here's what curl dumped after 1 minute:

* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
  0 2361k    0     0    0     0      0      0 --:--:--  0:01:00 --:--:--     0
* Connection #0 to host data.microbiomedata.org left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
naglepuff commented 3 months ago

It seems that the endpoint doesn't always fail, which is good in that it gives us a hint at what might be happening. For example, bulk downloading data from certain studies, like gold:Gs0110138 works fine.

My first thought is related to the code comment here. The NGINX plugin needs to treat all URLs for an archive as local ones, and we've set up proxies for certain remote hosts. With the influx of data over the past year, and this proxy configuration being unchanged, I decided to investigate the current state of our data objects collection.

-- First I'd like to know if we have data objects whose URLs don't match our configured proxies
select count(*) from data_object where url not like 'https://nmdcdemo.emsl.pnnl.gov%' and url not like 'https://data.microbiomedata.org%';
-- 6565

-- So we have some number of data objects that mod_zip probably can't find.
-- To find all domains used by data objects:
select distinct regexp_replace(url, '^https?://([^/]+).*$', '\1') as domain from data_object;
-- domain
--
-- storage.neonscience.org 
-- data.microbiomedata.org 
-- portal.nersc.gov
-- nmdcdemo.emsl.pnnl.gov
-- (5 rows) <-- note that we have a blank row representing null URL

I suspect that adding storage.neonscience.org and portal.nersc.gov to the list of proxies in the python code and nginx configuration will solve the problem.

ssarrafan commented 3 months ago

Mike fixed this issue. Thank you Mike! Discussed adding refresh of cache to release schedule with Eric.
From Slack: The frontend pod(s) for dev are only 2 hours old, while the corresponding pods for prod are ~5 days old. I do vaguely recall that in the past when we've seen issues with bulk download the culprit has been on the rancher/cloudflare side. It might also involve the portal's data-proxy service.

eecavanna commented 3 months ago

To resolve this issue today, he:

redeployed the portal-frontend service in the nmdc (production) namespace.