scientist-softserv / britishlibrary

Other
3 stars 0 forks source link

Write up of how to download several files to avoid service slowing/outage as per August 9th 2023 #468

Open j-basford opened 10 months ago

j-basford commented 10 months ago

Service outage on 9th August suspected to be related to a user attempting to download via the method described below: https://github.com/jamespjh/EngineeringForDataAnalysisExamples/tree/main

Effectively this is web scraping the repos and caused extremely slow downloads for the user and all tenants saw outages.

Rory said he would describe the appropriate route for the researcher to follow rather than this method, so they are able to download content and they can share this with their students.

We will outline the correct method on the 'about' page of the repositories so that we can point people to it.

cziaarm commented 10 months ago

Hello,

After reviewing the script and the specific collection it was downloading, I don't think that the web-scraping element of the endeavour was the thing causing the issue. There are more API-ish ways to collect the download links, perhaps configuring OAI-PMH sets based on collections would allow a simpler interface for consumers to gather metadata and data from a specific collection.

HOWEVER

The script is is concentrated on the single collection so I don't think that the web-scraping method used to obtain those links was the reason that the performance suffered in August. The issue is the subsequent download of the files which are relatively few but also relatively large (30-40GB+ each).

I am not an ansible expert, but my current hypothesis is that the method used to download the files once the URLs are collected https://github.com/jamespjh/EngineeringForDataAnalysisExamples/blob/main/ansible/download_data.yml#L33-L38 is downloading these large files asynchronously and that is gumming up the available bandwidth. I would suggest exploring methods offered by ansible or bash to ensure that the download requests are synchronous and perhaps introducing some pauses between the download requests to ensure other users of the service have fair access.

j-basford commented 10 months ago

Thank you Rory. So sounds like we need to look at the downloading big files issue that Graham has created a ticket for (and perhaps this is linked to uploading big files? Or might be totally separate of course!)?

j-basford commented 10 months ago

That's ticket #465 for info!

cziaarm commented 10 months ago

I don't think that issue #465 is related to this one