Closed eecavanna closed 3 months ago
The API endpoint is implemented here:
I used curl
to try to dump more information from the HTTP client's perspective...
$ curl -vvv \
--cookie "session=$DATA_PORTAL_SESSION_COOKIE" \
--output localfile \
https://data.microbiomedata.org/api/bulk_download/8c66069a-(...)
...and here's what curl
dumped after 1 minute:
* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
0 2361k 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0
* Connection #0 to host data.microbiomedata.org left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)
It seems that the endpoint doesn't always fail, which is good in that it gives us a hint at what might be happening. For example, bulk downloading data from certain studies, like gold:Gs0110138
works fine.
My first thought is related to the code comment here. The NGINX plugin needs to treat all URLs for an archive as local ones, and we've set up proxies for certain remote hosts. With the influx of data over the past year, and this proxy configuration being unchanged, I decided to investigate the current state of our data objects collection.
-- First I'd like to know if we have data objects whose URLs don't match our configured proxies
select count(*) from data_object where url not like 'https://nmdcdemo.emsl.pnnl.gov%' and url not like 'https://data.microbiomedata.org%';
-- 6565
-- So we have some number of data objects that mod_zip probably can't find.
-- To find all domains used by data objects:
select distinct regexp_replace(url, '^https?://([^/]+).*$', '\1') as domain from data_object;
-- domain
--
-- storage.neonscience.org
-- data.microbiomedata.org
-- portal.nersc.gov
-- nmdcdemo.emsl.pnnl.gov
-- (5 rows) <-- note that we have a blank row representing null URL
I suspect that adding storage.neonscience.org
and portal.nersc.gov
to the list of proxies in the python code and nginx configuration will solve the problem.
Mike fixed this issue. Thank you Mike!
Discussed adding refresh of cache to release schedule with Eric.
From Slack: The frontend pod(s) for dev are only 2 hours old, while the corresponding pods for prod are ~5 days old. I do vaguely recall that in the past when we've seen issues with bulk download the culprit has been on the rancher/cloudflare side. It might also involve the portal's data-proxy service.
An end user reported this issue. When they tried to perform a "bulk download" via the Data Portal UI in Google Chrome, they got a Cloudflare-branded error page with code
524
(i.e. "a timeout occurred") on it.A team member and I reproduced the issue on our own computers and, in addition, failed to get the "bulk download" feature to work with any data set.
Steps to reproduce
Colonization resistance against Candida
and, in the "Query Options" list that appears below it, click the item that saysName (Study)
524
(pending)
for exactly 1 minute (suspiciously specific), then changes to200
and prompts you to download a file namedarchive
. If you download the file, Brave's "Downloads" list shows a "Failed Network error" message.References