neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Praxis Data Server (`https://spineimage.ca`) #77

Open kousu opened 3 years ago

kousu commented 3 years ago

https://praxisinstitute.org wants to fund a Canada-wide spine scan sharing platform.

They were considering paying OBI as a vendor, and having them set up a neuroimaging repository. But had doubts about the quality of that solution and have looked around for others, and have landed on asking us for help.

We've proposed a federated data sharing plan and they are interested in pursuing this line.

Needs

kousu commented 1 year ago

Backups ("offsite")

Right now we only have two backups: one on spineimage.ca:/var, which doesn't really count as a good backup, and the one I created above, which, like the server, is on Arbutus in Victoria, and therefore one natural disaster could wipe out all the data. Moreover, there are not very many key holders -- just me at the moment -- and the data is stored inside of an OpenStack project owned by @jcohenadad, all of which makes neuropoly a single point of failure.

https://github.com/neuropoly/computers/blob/722545d38adc688fa621e4e25792371f36edd7fe/ansible/host_vars/spineimage.ca.yml#L2-L48

We should have other physical locations, to protect against natural disasters; the data sharing agreement requires us to stick to ComputeCanada as a line of defense against leaks, but since recently most of their clusters run OpenStack so we can choose a different physical location than arbutus.

We should also have other keyholders, ones who do not work for neuropoly so Praxis doesn't risk losing the data if we mess up or are attacked and get our accounts locked or wiped.


Towards all this I have been asking Praxis for help and they have found a keyholder. This person has been granted a separate ComputeCanada account and is ready to take on keyholding. They are apparently comfortable with the command line but don't have a lot of time to be involved, but they can hold the keys and, hopefully, bootstrap disaster recovery when needed.

Requesting Cloud Projects

In February, I emailed tech support, because despite seeing the list of alternate clouds, the sign up form doesn't provide a way to request one. They were extremely helpful about this:

To: "Nick Guenther" nick.guenther@polymtl.ca From: Jean-François Landry via Cloud Support cloud@tech.alliancecan.ca Date: Fri, 17 Feb 2023 22:02:57 +0000

2023-02-17 16:08 (America/Toronto) - Nick Guenther wrote:

How may we request resources on cedar.cloud.computecanada.ca or beluga.cloud.computecanada.ca? The google form at https://docs.google.com/forms/d/e/1FAIpQLSeU_BoRk5cEz3AvVLf3e9yZJq-OvcFCQ-mg7p4AWXmUkd5rTw/viewform doesn't allow choosing which cluster to use.

There is no specific option, just ask nicely in the free form description box.

Also, may we request a cloud allocation of only object storage? The form forces us to allocate at least 1 VM and one 20GB disk and 1 IP. Allocating and not using a virtual disk isn't that expensive for you, but allocating and not using an IP address is quite so and I don't want to waste one.

You can. Again, no specific "object store only" cloud RAS allocation, just fill in the minimum for VCPU/RAM etc. and please explain in the free form description box.

You can get up to 10TB of object storage through cloud RAS.

They also added

There is no geo-distributed storage system period, but the Arbutus object store works great with restic (note that restic tried to pack chunks into 16MB minimum objects by default so it will not generate hundreds of millions of tiny objects). Also please update to the latest 0.15.1 release, the new v2 repo format is considered stable and does include zstd compression by default.

So I don't expect any problems requesting storage for backups from them. It sounds like they are familiar with and use restic all the time.

Lack of Existing Keyholders

I realized that for the existing backups, there is only one restic key credential and also probably only one s3 credential to go with it at the moment, the one used by the bot:

$ echo $RESTIC_REPOSITORY 
s3:object-arbutus.cloud.computecanada.ca/def-jcohen-test2
$ restic key list
repository 2d22bf7f opened (version 1)
found 2 old cache directories in /home/kousu/.cache/restic, run `restic cache --cleanup` to remove them
 ID        User   Host           Created
----------------------------------------------------
*8bd433bf  gitea  spineimage.ca  2022-11-30 20:57:11
----------------------------------------------------

I am going to add s3+restic key credentials for:

I've done this by running

openstack ec2 credentials create -c access -c secret
PW=$(pwgen 100 1); echo "RESTIC_PASSWORD=$PW"
(echo $PW; echo $PW) | restic key add --user $name --host $institution.tld

for each person. I have the notes saved on /tmp and will be distributing them as securely as I can.

kousu commented 1 year ago

Backup Keyholder Onboarding

On Wednesday the 24th we are going to have a meeting with Praxis's nominee where we:

  1. Have them install restic

  2. Provide them restic credentials to the existing backups

  3. Test by having them do restic snapshots and restic ls latest

  4. Mention that restic disaster recovery docs are at https://restic.readthedocs.io/en/stable/050_restore.html

  5. Mention that the creds include s3 creds so they can be used with s3cmd or aws-cli

  6. Walk them through requesting a cloud project of their own.

    It should be on Graham, geographically separate from existing server/backups, and it doesn't need an IP address wasted on it. Here's the application form filled out with copy-pasteable answers:

    ~~Cloud Application Form~~ * Request type: New project + RAS request * Project Type: persistent * Project name suffix: custom -> backup * VCPUs: 1 * Instances: 1 * Volumes: 1 * Volume snapshots: 0 * RAM: 1.5 * Floating IPs: 1 * Persistent storage: 20 * Object storage: 1000 * Shared filesystem storage: 0 * Explain why you need cloud resources: > I am working with a team hosting a research data server https://spineimage.ca on Arbutus that is looking for storage space for backups. > > We only need object storage. Please do not actually allocate any VMs, volumes, and especially no IP addresses for this. > > Please allocate the cloud project on Graham, so that a disaster at Arbutus will not risk our backups. > > Thank you! * Explain why the various Compute Canada HPC clusters are not suitable for your needs: > The HPC clusters are primarily for compute, not storage. * Explain what your plan is for efficiently using the cloud resources requested: > We are using restic, a deduplicating and compressing backup system, we do not need any compute > resources allocated, only storage. We are requesting only as much storage > as we have provisioned on the original server, and do not expect to fill either up at this time. * Describe your plans for maintenance and security upkeep: > We do not intend to run a server under this cloud allocation.

    EDIT: we were misinformed.

    2023-05-24 16:01 Lucas Whittington via Cloud Support wrote:

    Unfortunately, Arbutus is the only Alliance cloud that provides object storage. Is is stored on separate machines from our volume cluster but won't protect you in the event of an incident that affects our entire data centre. Let me know if you would like to proceed.

    Instead, we will build our own file server on the other cluster. I'll either use minio or just sftp. Here's the updated request:

    Cloud Application Form * Request type: New project + RAS request * Project Type: persistent * Project name suffix: custom -> backup * VCPUs: 1 * Instances: 1 * Volumes: 1 * Volume snapshots: 0 * RAM: 1.5 * Floating IPs: 1 * Persistent storage: 1000 * Object storage: 0 * Shared filesystem storage: 0 * Explain why you need cloud resources: > I am working with a team hosting a research data server https://spineimage.ca on Arbutus that is looking for storage space for backups. > > Please allocate the cloud project on Graham, so that a disaster at Arbutus will not risk our backups. > > Thank you! * Explain why the various Compute Canada HPC clusters are not suitable for your needs: > The HPC clusters are primarily for compute, not storage. * Explain what your plan is for efficiently using the cloud resources requested: > We are using restic, a deduplicating and compressing backup system. We are requesting only as much storage > as we have provisioned on the original server, and do not expect to fill either up at this time. * Describe your plans for maintenance and security upkeep: > We will enable fail2ban, debian's automatic-upgrades, and install netdata as a alerting system.
  7. Leave them with instructions on how to generate and send us countervailing s3 creds

    S3 Credential Generation (based on https://docs.alliancecan.ca/wiki/Arbutus_object_storage) 1. Install `openstack` * `brew install openstackclient` * `apt install python3-openstackclient` * otherwise: `pip install python-openstackclient` 2. Login to https://graham.cloud.computecanada.ca ![Screenshot 2023-05-24 at 02-19-01 Connexion - OpenStack Dashboard](https://github.com/neuropoly/data-management/assets/987487/b664bad9-8518-4276-bc24-6e4f0d428cda) 3. Download the OpenStack RC File from under your profile menu in the top right corner ![openstack rc file](https://github.com/neuropoly/data-management/assets/987487/bb4fc40c-21af-4a71-98b2-ae141e95280d) 4. Load it into your shell ``` $ . /tmp/def-jcohen-dev-openrc.sh Please enter your OpenStack Password for project def-jcohen-dev as user nguenthe: [ CLOUD PASSWORD HERE ] ``` 5. Make S3 credentials: ``` $ openstack ec2 credentials create -c access -c secret +--------+----------------------------------+ | Field | Value | +--------+----------------------------------+ | access | 5390ea0b6d4001ccb1093c91b311e181 | | secret | 1f93ff01ddcae38594c5fcfceb24b850 | +--------+----------------------------------+ ``` 'access' is AWS_ACCESS_KEY_ID and 'secret' is AWS_SECRET_ACCESS_KEY, as used by restic or s3cmd or awscli.

    We need s3 credentials generated for:

    • the backup bot
    • themselves
    • @kousu
    • @mguaypaq
    • @jcohenadad ?
    • David Cadotte ?

    Please forward the credentials privately to each individual keyholder. We will discuss at the meeting what the safest way to do that is.

  8. I will initialize the new repository and then hand out RESTIC_PASSWORDs to all the keyholders using pwgen 100 1.

kousu commented 11 months ago

Disk Latency Problem

I just came up against this after rebooting:

Nov 07 07:39:30 spineimage.ca systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.
Nov 07 07:39:31 spineimage.ca systemd[1]: dev-disk-by\x2duuid-2067a784\x2d07ef\x2d4317\x2d88d0\x2d4591442577d1.device: Job dev-disk-by\x2duuid-2067a784\x2d>
Nov 07 07:39:31 spineimage.ca systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-2067a784\x2d07ef\x2d4317\x2d88d0\x2d4591442577d1.device - /dev/d>
Nov 07 07:39:31 spineimage.ca systemd[1]: Dependency failed for systemd-fsck@dev-disk-by\x2duuid-2067a784\x2d07ef\x2d4317\x2d88d0\x2d4591442577d1.service ->
Nov 07 07:39:31 spineimage.ca systemd[1]: Dependency failed for srv-gitea.mount - /srv/gitea.
Nov 07 07:39:31 spineimage.ca systemd[1]: Dependency failed for gitea.service - Gitea (Git with a cup of tea).
Nov 07 07:39:31 spineimage.ca systemd[1]: gitea.service: Job gitea.service/start failed with result 'dependency'.
Nov 07 07:39:31 spineimage.ca systemd[1]: srv-gitea.mount: Job srv-gitea.mount/start failed with result 'dependency'.

i.e. /srv/gitea wasn't mounted, so gitea wasn't running. Can we make this more reliable somehow?

After a second reboot, it came up fine. So I don't know, maybe it was a fluke.