Open kousu opened 3 years ago
bidirectional mirroring
│
│ ┌───────────────┐
┌────┼──────────────────────────▲│ │
│┼───────────────────────────────┤ data.site4.ca │
││ │ │
││ ├────────┬────┬─┘
││ ┌─────┴────────┘ │
││ │ │
││ │ │
┌───────────────┘▼┐ │ │
│ │ │ │
│ images.site5.ca │ │ │
│ │ │ │
└─────────────────┴─┐ │ │
│ │ ├──────mirroring
│ ├────notifying │
│ │ │
│ │ │
│ │ │
xxx│xxxxxxxxxxxxxxxxxxxxxx▼xxxxxx │
x┌─▼───────────────────────────┐x ▼
┌─────────────────┐ x│ │x ┌─────────────────┐
│ │ x│ data.praxisinstitute.org │x │imagerie.site7.ca│
│ spines.site6.ca ├─────────────────►│ │◄─────────┴─┬───────────────┘
│ │ x└─────────────────────────────┘x │
└──────────┬────┬─┘ xxxx▲xxxxxxx▲xxxxxxxxxxxxxxx▲xxxx │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ │ │ │ │ ┌───────────────┐
│ │ │ │ ├───────────────┼──────┘ data.site3.ca │
│ └───────────────────────┼───────┼───────────────┴───────────────┼──────▼─┬─────────────┘
▼ │ │ │ │
┌────────────────┐ │ │ │ │
│ ├─────────────────────────┘ │ │ │
│ data.site1.ca │ │ │ │
│ │ │ │ │
└─────────────┬─▲┘ │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ ┌────────────┴┐ │ │
│ └─────────────────────┤ │ │ │
│ │data.site2.ca◄──────────────────────────────┼────────┘
└──────────────────────►│ │ │
└─────────────┘◄─────────────────────────────┘
┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐ ┌───────────────────────────┐
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ PACS ├─────┬──────► BIDS ├─────┬────►│ data.example.ca ├─────┬────►│ data.praxisinstitute.ca │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ (data server) │ │ │ (portal) │
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ │
└──────────────────┘ │ └──────────────────┘ │ └───────────────────┘ │ └───────────────────────────┘
│ │ │
│ │ │
│ │ │
│ │ │
export scripts uploader notifier
written by each site. written by us written by us
(`git push`, `rsync -a`, etc) (cronjob: curl data.praxisinstitute.ca/notify ...)
It's easy to spend a lot of money writing software from scratch. I don't think we should do that. I think what we should do is design some customizations of existing software and build packages that deploy them.
I have two options in mind for the data servers:
GIN is itself built on top of pre-existing open source projects: Gogs, git-annex, datalad, git, combined in a customized package. We would take it and further customize it. It is a little more sprawling than NextCloud. Being focused on neuroscience, we could easily upstream customizations we design back to them to help out the broader community.
NextCloud is a lot simpler to use than datalad. You can mount it onto Windows, Linux, and macOS as a cloud disk (via WebDAV). It also has a strong company behind it, lots of users, good apps. It's meant for more general use than science; actually, it was never designed for science. It would be harder to share any improvements we make to it; though we could publish our packages and any plugins back to the wider NextCloud ecosystem. It has some other weaknesses too.
With GIN, uploading can be done with git
:
git remote add origin git@data.site1.ca:my-new-dataset.git
git push -u origin master # or, actually, maybe GIN forces you to first make the remote repo in its UI? Unsure
git annex sync --content
Downloading is replacing the first two lines with git clone
.
Windows and macOS do not have a git client built in.
With NextCloud, uploading can be done with davfs2
+ rsync
:
mount -t davfs2 data.site1.ca /mnt/data # something close to this anyway
rsync -av my-new-dataset /mnt/data/
Downloading is just reversing the arguments to rsync
.
There's also cadaver
, and Windows and macOS have WebDAV built in.
GIN is based on git, so it has very strong versioning.
There are git fsck
and git annex fsck
to validate that what's on-disk is as expected.
NextCloud only supports weak versioning.
But maybe we can write a plugin that improves that. Somehow. We would have to figure out a way to mount an old version of a dataset.
NextCloud has federated ACLs built in: users on data.site1.ca
can grant chosen users on spines.site6.ca
access to specific folders A/
, B/
and D/
.
I am unsure what GIN has; since it's based on Gogs, it probably has public/private/protected datasets, all the same controls that Github and Gitlab implement, but I don't think it supports federated ACLs. Federation with GIN might look like everyone having to have one account per site.
But maybe we could improve that; perhaps we could patch GIN to supported federated ACLs as well. We would need to study how NextCloud does it, how GIN does it, and see where we can fit them in.
In a federated model, data-sharing is done bidirectionally: humans at each site grant each other access to data sets, one at a time.
We should encourage the actual data sharing to happen via mirroring, for the sake of encouraging resiliancy in the network.
I think we should encourage the
Gitlab supports mirroring; on https://gitlab.com/$user/$repo/-/settings/repository you will find the mirror settings
We need to replicate this kind of UI in whatever we deploy.
For the portal, we can probably write most of it using a static site generator like hugo, plus a small bit of code to add (and remove) datasets.
The dataset list can be generated either by pushing or pulling: the data servers could notify the portal (this is how I've drawn the diagrams above) or the portal could connect to the data servers in a cronjob to ask them what datasets they have. Generally, the latter is more stable, but the former is more accurate.
It should be possible to keep the list of datasets, one per file, in a folder on the portal site, and have hugo
automatically read them all and produce a big, cross-referenced, searchable index.
We should provide an automated installer for each site to deploy the software parts. It should be as automated as possible: no more than 5 manual install steps.
We can build the packages in:
I think .deb is the smoothest option here; I have some experience using pacur, and we can use their package server to host the .debs. Whatever we pick, the important thing is that we deliver the software reproducibly and with as little manual customization as possible.
I think we should produce two packages:
praxis-data-server
- the per-site softwarepraxis-index-server
- the portal siteWe might also need to produce uploader scripts; but because praxis-data-server
will use standard upload protocols it's not as important to do so; moreover, because the uploader will be run directly by users, it will need to deal with cross-platform issues which makes it harder to package. I think, at least as a first draft, we should just document what command lines will get your data uploaded
write specification for what a proper dataset looks like; probably BIDS, maybe with some extra restrictions
deploy data server prototypes
write data server customizations
write uploader scripts?
write PACS -> BIDS scripts that conform to that specification
write portal site
write data server -> portal notifier
sender
receiver
write packages
write uploading documentation
write downloading documentation
write ACL documentation
datalad
(the GIN client) compatible with Windows? Is it compatible with Windows without using WSL?We can build a federated data system on either GIN or nextcloud. Either one will require some software development to tune it for our use case, but much less than writing a system from scratch. Both support widely supported network protocols which makes them cross-compatible, reliable, and avoids the cost of developing custom clients.
Which works out to about 18 or 19 sites that Praxis can fund this year.
I'm going to make some demos to make this more concrete.
I'm starting with NextCloud. I'm going to deploy 3 NextClouds and configure them with federation sharing.
The first thing to do is to get some hardware. Vultr.com has cheap VPSes. I bought three of them in Canada (the screenshot covered my datacenter selection; trust me, it's in Canada):
Notice I'm choosing Debian, but any Linux option would work.
Just gotta wait for them to deploy....
And they're up:
The second thing to do is set up DNS so these are live net-wide servers.
I went over to my personal DNS server and added
Now just gotta wait for that to deploy...
and they're up too:
[kousu@requiem ~]$ dig data1.praxis.kousu.ca
216.128.176.232
[kousu@requiem ~]$ dig data2.praxis.kousu.ca
216.128.179.150
[kousu@requiem ~]$ dig data3.praxis.kousu.ca
149.248.50.100
I just need to make sure I have access secured. I'm going to do two things:
I go to the VPS settings one at a time and grab the root passwords:
then I log in, and confirm the system looks about right:
[kousu@requiem ~]$ ssh root@data1.praxis.kousu.ca
root@data1.praxis.kousu.ca's password:
Linux data1.praxis.kousu.ca 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@data1:~# ls /home/
root@data1:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Then I use ssh-copy-id
to enroll myself:
[kousu@requiem ~]$ ssh-copy-id -i ~/.ssh/id_ed25519 root@data1.praxis.kousu.ca
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/kousu/.ssh/id_ed25519.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@data1.praxis.kousu.ca's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@data1.praxis.kousu.ca'"
and check to make sure that only the key(s) you wanted were added.
[kousu@requiem ~]$ ssh root@data1.praxis.kousu.ca
Linux data1.praxis.kousu.ca 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu May 20 17:59:53 2021 from 192.222.158.190
root@data1:~#
And now that that works, I disable root password login, which is a pretty important security baseline:
root@data1:~# sed -i 's/PermitRootLogin yes/#PermitRootLogin no/' /etc/ssh/sshd_config
root@data1:~# systemctl restart sshd
In a different terminal without disconnecting, in case we need to do repairs, verify this worked by:
that I can still ssh in using the key
[kousu@requiem ~]$ ssh root@data1.praxis.kousu.ca
Linux data1.praxis.kousu.ca 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu May 20 18:06:42 2021 from 192.222.158.190
that, when I tell ssh to only use password auth, it rejects me
[kousu@requiem ~]$ ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no root@data1.praxis.kousu.ca
root@data1.praxis.kousu.ca's password:
Permission denied, please try again.
I'm also going to add a sudo
account, as a backup:
First, invent a password:
[kousu@requiem ~]$ pass generate -n -c servers/praxis/data1.praxis.kousu.ca
Then make the account:
root@data1:~# sed -i 's|/bin/sh|/bin/bash|' /etc/default/useradd
root@data1:~# useradd -m kousu
root@data1:~# passwd kousu
New password:
Retype new password:
passwd: password updated successfully
root@data1:~# usermod -a -G sudo kousu
Test the account:
[kousu@requiem ~]$ ssh data1.praxis.kousu.ca
kousu@data1.praxis.kousu.ca's password:
Linux data1.praxis.kousu.ca 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu May 20 18:14:38 2021 from 192.222.158.190
$ sudo ls
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
[sudo] password for kousu:
$ groups
kousu sudo
$
So this means I have two ways in, the root password is disabled, my own user password is lengthy and secure.
Now repeat the same for data2.praxis.kousu.ca and data3.praxis.kousu.ca.
Set system hostname -> already done by Vultr, thanks Vultr
(and repeat for each of the three)
root@data1:~# apt-get update && DEBIAN_FRONTEND=noninteractive apt-get upgrade -y
Hit:1 http://deb.debian.org/debian buster InRelease
Get:2 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:3 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:4 http://security.debian.org/debian-security buster/updates/main Sources [185 kB]
Get:5 http://security.debian.org/debian-security buster/updates/main amd64 Packages [289 kB]
Get:6 http://security.debian.org/debian-security buster/updates/main Translation-en [150 kB]
Fetched 740 kB in 0s (2283 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
(and repeat for each of the three)
unattended-upgrades
apt-get install unattended-upgrades
Configure it like I've done for our internal servers: enable regular updates, not just security ones, do updates once a week, enable auto-reboot.
root@data1:~# hostname > /etc/mailname
root@data1:~# DEBIAN_FRONTEND=noninteractive apt-get install -y opensmtpd
root@data1:~# echo nick@kousu.ca >> ~root/.forward
test mailer:
in one terminal:
# journalctl -f -u opensmtpd
With the help of https://www.mail-tester.com/, in another:
root@data3:~# mail -s "testing outgoing" test-bxdanliiq@srv1.mail-tester.com
Cc:
Hi there, will this go through?
opensmtpd logs say:
May 21 01:16:43 data3.praxis.kousu.ca smtpd[5954]: 84dff16a26522d0b smtp event=connected address=local host=data3.praxis.kousu.ca
May 21 01:16:43 data3.praxis.kousu.ca smtpd[5954]: 84dff16a26522d0b smtp event=message address=local host=data3.praxis.kousu.ca msgid=1eb206e1 from=<root@data3.praxis.kousu.ca> to=<test-bxdanliiq@srv1.mail-tester.com> size=471 ndest=1 proto=ESMTP
May 21 01:16:43 data3.praxis.kousu.ca smtpd[5954]: 84dff16a26522d0b smtp event=closed address=local host=data3.praxis.kousu.ca reason=quit
May 21 01:16:44 data3.praxis.kousu.ca smtpd[5954]: 84dff16ec08d0171 mta event=connecting address=smtp+tls://94.23.206.89:25 host=mail-tester.com
May 21 01:16:44 data3.praxis.kousu.ca smtpd[5954]: 84dff16ec08d0171 mta event=connected
May 21 01:16:45 data3.praxis.kousu.ca smtpd[5954]: 84dff16ec08d0171 mta event=starttls ciphers=version=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256
May 21 01:16:45 data3.praxis.kousu.ca smtpd[5954]: smtp-out: Server certificate verification succeeded on session 84dff16ec08d0171
May 21 01:16:46 data3.praxis.kousu.ca smtpd[5954]: 84dff16ec08d0171 mta event=delivery evpid=1eb206e1eff6a367 from=<root@data3.praxis.kousu.ca> to=<test-bxdanliiq@srv1.mail-tester.com> rcpt=<-> source="149.248.50.100" relay="94.23.206.89 (mail-tester.com)" delay=3s result="Ok" stat="250 2.0.0 Ok: queued as 3B00EA0237"
May 21 01:16:56 data3.praxis.kousu.ca smtpd[5954]: 84dff16ec08d0171 mta event=closed reason=quit messages=1
and it got to mail-tester.com; but mail-tester scored it low because the DNS needs work:
occ
mailer needs: sudo ufw allow 25/tcp
. This must be an upgrade on Vultr's end since the last time I set up servers.
Alternatives:
probably not good for data sharing; multiple users can't share a dataset? or if they can requires connecting to the host server to set the version, using CLI tools
I'm not sure I understand that. For example, in the context of NeuroPoly's internal data (which are currently versioned/distributed with git-annex), would it be considered "one user sharing a dataset"? And if so, would ZFS be limited for this specific use-case?
My (weak) understanding is that with zfs you have to do:
git commit
~= sudo zfs snapshot $ZFS_ROOT/$VERSION
git checkout $VERSION
~= sudo mount -t zfs $ZFS_ROOT/$VERSION $PATH
So actually, yes, a single zfs instance can be shared with multiple users, so long as everyone has a) direct ssh
access to the data server b) sudo
rights on that data server.
Alternately, an admin (i.e. you or me or Alex) could ssh in, mount a snapshot, and expose it more safely to users over afp://, smb://, nfs://, sshfs://. But then users need to be constantly coordinating with their sysadmins. Maybe that's okay for slow-moving datasets like the envisioned Praxis system but it would be pretty awkward for daily use here.
Although there is this: on Linux, you can do 'checkout' without mounting like this:
# cd $PATH/.zfs/snapshot/$VERSION
You still need sudo
to make a commit; probably to make any change at all. But I guess we could make this work if the goal is just reproducibility?
It looks like we are going towards a centralized solution. In brief:
Question is: where to host this centralized server
It's Demo day:
We have been allocated an account on https://arbutus.cloud.computecanada.ca/. Docs at https://docs.computecanada.ca/wiki/Cloud_Quick_Start.
I'm going to install GIN on it: https://gin.g-node.org/G-Node/Info/wiki/In+House
Let's see how fast I can do this.
xclip -selection clipboard ~/.ssh/id_ed25519.neuropoly.pub
and paste into https://arbutus.cloud.computecanada.ca/project/key_pairs -> Import Public Key; repeat for @jcohenadad's key from ansible\
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly jcohen@206.12.93.20
The authenticity of host '206.12.93.20 (206.12.93.20)' can't be established.
ED25519 key fingerprint is SHA256:Z82G+UO/D3ZRJV53eQeaVt2rWSaVFmhLcEwbHO519Ig.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '206.12.93.20' (ED25519) to the list of known hosts.
jcohen@206.12.93.20: Permission denied (publickey).
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly root@206.12.93.20
root@206.12.93.20: Permission denied (publickey).
hm okay what's wrong?
Okay docs say I should be using the username "ubuntu". That doesn't work either.
It seems like it just hung? I deleted and remade the instance with
Name = praxx Availability Zone = any (the docs said I shouldn't have changed this) Source = Ubuntu-20.04.2-Focal-x64-2021-05 flavor = p2-3gb everything else at defaults
I still can't get in though:
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly ubuntu@206.12.93.20
The authenticity of host '206.12.93.20 (206.12.93.20)' can't be established.
ED25519 key fingerprint is SHA256:nAE3NfUZ1R6uSdr3GUeuJPJ1gENAQdexM29r0EM8vxs.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '206.12.93.20' (ED25519) to the list of known hosts.
ubuntu@206.12.93.20: Permission denied (publickey).
[kousu@requiem ~]$ ssh -i ~/.ssh/id_rsa ubuntu@206.12.93.20
id_rsa id_rsa.github
[kousu@requiem ~]$ ssh -i ~/.ssh/id_rsa ubuntu@206.12.93.20
ubuntu@206.12.93.20: Permission denied (publickey).
Oh I see what the problem is:
Paste your public key (only RSA type SSH keys are currently supported).
drat.
But I...added my rsa key? And it's still not working?
[kousu@requiem ~]$ ssh -i ~/.ssh/id_rsa ubuntu@206.12.93.20
ubuntu@206.12.93.20: Permission denied (publickey).
hm.
The system log (which openstack will show you) says
[ 43.704728] cloud-init[1249]: ci-info: no authorized SSH keys fingerprints found for user ubuntu.
so, hm. Why?
Oh I missed this step:
Key Pair: From the Available list, select the SSH key pair you created earlier by clicking the upwards arrow on the far right of its row. If you do not have a key pair, you can create or import one from this window using the buttons at the top of the window (please see above). For more detailed information on managing and using key pairs see SSH Keys.
Delete and recreate with
Name = praxis-gin
Availability Zone = any (the docs said I shouldn't have changed this)
Source = Ubuntu-20.04.2-Focal-x64-2021-05
flavor = p2-3gb
keypair = nguenthe-requiem-rsa
It only allows you to init with a single keypair! Ah.
Got in:
[kousu@requiem ~]$ ssh-keygen -R 206.12.93.20
# Host 206.12.93.20 found: line 119
/home/kousu/.ssh/known_hosts updated.
Original contents retained as /home/kousu/.ssh/known_hosts.old
[kousu@requiem ~]$ ssh -i ~/.ssh/id_rsa ubuntu@206.12.93.20
The authenticity of host '206.12.93.20 (206.12.93.20)' can't be established.
ED25519 key fingerprint is SHA256:qJO/JofxCKeaGD71R5fxkGYlPBFAjfPOOPeeiWByqUc.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '206.12.93.20' (ED25519) to the list of known hosts.
Enter passphrase for key '/home/kousu/.ssh/id_rsa':
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-73-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Wed Jul 7 16:42:01 UTC 2021
System load: 0.3 Processes: 123
Usage of /: 6.5% of 19.21GB Users logged in: 0
Memory usage: 6% IPv4 address for ens3: 192.168.233.67
Swap usage: 0%
1 update can be applied immediately.
To see these additional updates run: apt list --upgradable
The list of available updates is more than a week old.
To check for new updates run: sudo apt update
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
ubuntu@praxis-gin:~$ sudo ls
ubuntu@praxis-gin:~$
ubuntu@praxis-gin:~$ sudo apt-get update && DEBIAN_FRONTEND=noninteractive sudo apt-get dist-upgrade -y
[kousu@requiem ~]$ ssh root@joplin -- cat '~/.ssh/authorized_keys' | ssh ubuntu@206.12.93.20 -- sudo tee -a '/root
[kousu@requiem ~]$ ssh root@joplin -- cat '~/.ssh/authorized_keys' | ssh ubuntu@206.12.93.20 -- tee -a '~/.ssh/authorized_keys'
test: root@
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly root@206.12.93.20
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-73-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Wed Jul 7 16:49:02 UTC 2021
System load: 0.05 Processes: 128
Usage of /: 9.2% of 19.21GB Users logged in: 1
Memory usage: 10% IPv4 address for ens3: 192.168.233.67
Swap usage: 0%
0 updates can be applied immediately.
*** System restart required ***
Last login: Wed Jul 7 16:48:06 2021 from 104.163.172.27
root@praxis-gin:~# logout
Connection to 206.12.93.20 closed.
ubuntu@
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly ubuntu@206.12.93.20
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-73-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Wed Jul 7 16:49:08 UTC 2021
System load: 0.04 Processes: 127
Usage of /: 9.2% of 19.21GB Users logged in: 1
Memory usage: 10% IPv4 address for ens3: 192.168.233.67
Swap usage: 0%
0 updates can be applied immediately.
*** System restart required ***
Last login: Wed Jul 7 16:42:04 2021 from 104.163.172.27
[x] finish updates: ubuntu@praxis-gin:~$ sudo reboot
[x] Log back in
[x] Install docker, since that's how GIN is packaged:
ubuntu@praxis-gin:~$ sudo apt-get install docker.io
ubuntu@praxis-gin:~$ sudo systemctl enable --now docker
ubuntu@praxis-gin:~$ sudo usermod -a -G docker ubuntu # grant rights
ubuntu@praxis-gin:~$ logout
Connection to 206.12.93.20 closed.
Test:
[kousu@requiem ~]$ ssh -i ~/.ssh/id_rsa ubuntu@206.12.93.20
Enter passphrase for key '/home/kousu/.ssh/id_rsa':
Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-77-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Wed Jul 7 16:53:32 UTC 2021
System load: 0.67 Processes: 119
Usage of /: 11.2% of 19.21GB Users logged in: 0
Memory usage: 9% IPv4 address for docker0: 172.17.0.1
Swap usage: 0% IPv4 address for ens3: 192.168.233.67
0 updates can be applied immediately.
Last login: Wed Jul 7 16:50:59 2021 from 104.163.172.27
ubuntu@praxis-gin:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Start following https://gin.g-node.org/G-Node/Info/wiki/In+House:
[x] Install ubuntu@praxis-gin:~$ docker pull gnode/gin-web:live
[x] firewall again: GIN wants port 3000 and 2222 open, so: https://arbutus.cloud.computecanada.ca/project/security_groups -> Manage -> Add rules for 3000 and 2222 ingress
[x] Run it: ubuntu@praxis-gin:~$ docker run -p 3000:3000 -p 2222:22 -d gnode/gin-web:live
(NOTE: small bug in the instructions: they tell you to install the :live
version but then to run the ` version which in docker implies
:latest`).
[x] Test it seems to be up:
the ports are listening:
ubuntu@praxis-gin:~$ sudo netstat -nlpt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:2222 0.0.0.0:* LISTEN 3520/docker-proxy
tcp 0 0 127.0.0.1:44435 0.0.0.0:* LISTEN 1376/containerd
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 536/systemd-resolve
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 611/sshd: /usr/sbin
tcp 0 0 0.0.0.0:3000 0.0.0.0:* LISTEN 3507/docker-proxy
tcp6 0 0 :::22 :::* LISTEN 611/sshd: /usr/sbin
[kousu@requiem ~]$ ssh -p 2222 206.12.93.20
The authenticity of host '[206.12.93.20]:2222 ([206.12.93.20]:2222)' can't be established.
ED25519 key fingerprint is SHA256:41ELnYTqwKKUzA9zMFSopXmi953gc+ZGco9f4vqvF3g.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '[206.12.93.20]:2222' (ED25519) to the list of known hosts.
kousu@206.12.93.20: Permission denied (publickey,keyboard-interactive).
(I don't have a key inside of GIN yet, so of course this fails, but it's listening)
I filled out the options like this:
[x] Give it a DNS name by logging into to my personal DNS server (I don't have rights to dns://neuro.polymtl.ca) and mapping A data1.praxis.kousu.ca -> 206.12.93.20
.
Verify:
[kousu@requiem ~]$ ping data1.praxis.kousu.ca
PING data1.praxis.kousu.ca (206.12.93.20) 56(84) bytes of data.
64 bytes from 206-12-93-20.cloud.computecanada.ca (206.12.93.20): icmp_seq=1 ttl=43 time=86.3 ms
64 bytes from 206-12-93-20.cloud.computecanada.ca (206.12.93.20): icmp_seq=2 ttl=43 time=78.6 ms
^C
--- data1.praxis.kousu.ca ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 78.619/82.471/86.324/3.852 ms
Change the hostname to match:
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ hostname
praxis-gin
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ sudo vi /etc/hostname
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat /etc/hostname
data1.praxis.kousu.ca
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ sudo hostname $(cat /etc/hostname)
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ hostname $(cat /etc/hostname)
hostname: you must be root to change the host name
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ hostname
data1.praxis.kousu.ca
7.[ ] TLS Okay TLS is always a hoot. Let's see if I can do this in 10 minutes eh? I can front Gogs with an nginx reverse proxy
Actually I have this already, I can just copy the config out of https://github.com/neuropoly/computers/tree/master/ansible/roles/neuropoly-tls-server
ubuntu@praxis-gin:~$ sudo apt-get install nginx dehydrated
nginx config:
ubuntu@praxis-gin:~$ sudo vi /etc/nginx/sites-available/acme
ubuntu@praxis-gin:~$ cat /etc/nginx/sites-available/acme
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
# This glues together using both a reverse-proxy over to the dev server, while still letting ACME work
# https://serverfault.com/questions/768509/lets-encrypt-with-an-nginx-reverse-proxy
# Notice: this server { } listens to *all* hostnames, so any DNS record pointed at this box can be issued a ACME cert
location ^~ /.well-known/acme-challenge {
alias /var/lib/dehydrated/acme-challenges;
}
# enforce https
# so long as this is the only `server{}` run on port 80, all http connections get rewritten to https ones.
# ($host is pulled from the client's request, along with $request_uri, so this line works for *any* virtual host we care to make)
location / {
# 307 is a temporary redirect, to avoid causing bugs due to browser caching while developing this ability
# but 301 would be more efficient in the long term
return 307 https://$host$request_uri;
}
}
server {
# this is a copy of what's in "snippets/ssl.conf", but without claiming 'default_server'
# it is necessary in order to auto-verify the SSL config after deploying certificates.
listen 443 ssl;
listen [::]:443 ssl;
include "snippets/_ssl.conf";
}
ubuntu@praxis-gin:~$ sudo vi /etc/nginx/snippets/_ssl.conf
ubuntu@praxis-gin:~$ cat /etc/nginx/snippets/_ssl.conf
ssl_certificate /etc/ssl/acme/data1.praxis.kousu.ca/fullchain.pem;
ssl_certificate_key /etc/ssl/acme/data1.praxis.kousu.ca/privkey.pem;
gzip off; # anti-BREACH: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=773332
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers "HIGH:!aNULL"; # OpenBSD's recommendation: https://man.openbsd.org/httpd.conf
ssl_prefer_server_ciphers on;
ubuntu@praxis-gin:~$ cd /etc/nginx/sites-enabled/
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ sudo ln -s ../sites-available/acme
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ sudo rm default
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ ls -l
total 0
lrwxrwxrwx 1 root root 23 Jul 7 17:32 acme -> ../sites-available/acme
dehydrated config:
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ hostname | sudo tee /etc/dehydrated/domains.txt
data1.praxis.kousu.ca
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat /etc/dehydrated/conf.d/neuropoly.sh
AUTO_CLEANUP=yes
# TODO: set this to the sysadmin mailing list: https://github.com/neuropoly/computers/issues/39
CONTACT_EMAIL=neuropoly@googlegroups.com
CERTDIR=/etc/ssl/acme
# it would be nice to use the default more efficient ECDSA keys
#KEY_ALGO=secp384r1
# but netdata is incompatible with them
KEY_ALGO=rsa
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat acme
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
# This glues together using both a reverse-proxy over to the dev server, while still letting ACME work
# https://serverfault.com/questions/768509/lets-encrypt-with-an-nginx-reverse-proxy
# Notice: this server { } listens to *all* hostnames, so any DNS record pointed at this box can be issued a ACME cert
location ^~ /.well-known/acme-challenge {
alias /var/lib/dehydrated/acme-challenges;
}
# enforce https
# so long as this is the only `server{}` run on port 80, all http connections get rewritten to https ones.
# ($host is pulled from the client's request, along with $request_uri, so this line works for *any* virtual host we care to make)
location / {
# 307 is a temporary redirect, to avoid causing bugs due to browser caching while developing this ability
# but 301 would be more efficient in the long term
return 307 https://$host$request_uri;
}
}
server {
# this is a copy of what's in "snippets/ssl.conf", but without claiming 'default_server'
# it is necessary in order to auto-verify the SSL config after deploying certificates.
listen 443 ssl;
listen [::]:443 ssl;
include "snippets/_ssl.conf";
}
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat ../snippets/_ssl.conf
ssl_certificate /etc/ssl/acme/data1.praxis.kousu.ca/fullchain.pem;
ssl_certificate_key /etc/ssl/acme/data1.praxis.kousu.ca/privkey.pem;
gzip off; # anti-BREACH: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=773332
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers "HIGH:!aNULL"; # OpenBSD's recommendation: https://man.openbsd.org/httpd.conf
ssl_prefer_server_ciphers on;
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ sudo systemctl restart nginx
Verify:
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ curl -v https://data1.praxis.kousu.ca
* Trying 206.12.93.20:443...
* TCP_NODELAY set
* Connected to data1.praxis.kousu.ca (206.12.93.20) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=data1.praxis.kousu.ca
* start date: Jul 7 16:40:47 2021 GMT
* expire date: Oct 5 16:40:46 2021 GMT
* subjectAltName: host "data1.praxis.kousu.ca" matched cert's "data1.praxis.kousu.ca"
* issuer: C=US; O=Let's Encrypt; CN=R3
* SSL certificate verify ok.
> GET / HTTP/1.1
> Host: data1.praxis.kousu.ca
> User-Agent: curl/7.68.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx/1.18.0 (Ubuntu)
< Date: Wed, 07 Jul 2021 17:43:21 GMT
< Content-Type: text/html
< Content-Length: 612
< Last-Modified: Tue, 21 Apr 2020 14:09:01 GMT
< Connection: keep-alive
< ETag: "5e9efe7d-264"
< Accept-Ranges: bytes
<
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
* Connection #0 to host data1.praxis.kousu.ca left intact
One thing not in ansible is the gogs reverse proxy part:
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat gogs
server {
server_name _;
listen 443 ssl;
listen [::]:443 ssl;
include "snippets/_ssl.conf";
location / {
proxy_set_header X-Real-IP $remote_addr;
proxy_pass http://127.0.0.1:3000/;
}
}
ubuntu@praxis-gin:/etc/nginx/sites-enabled$ cat acme
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
# This glues together using both a reverse-proxy over to the dev server, while still letting ACME work
# https://serverfault.com/questions/768509/lets-encrypt-with-an-nginx-reverse-proxy
# Notice: this server { } listens to *all* hostnames, so any DNS record pointed at this box can be issued a ACME cert
location ^~ /.well-known/acme-challenge {
alias /var/lib/dehydrated/acme-challenges;
}
# enforce https
# so long as this is the only `server{}` run on port 80, all http connections get rewritten to https ones.
# ($host is pulled from the client's request, along with $request_uri, so this line works for *any* virtual host we care to make)
location / {
# 307 is a temporary redirect, to avoid causing bugs due to browser caching while developing this ability
# but 301 would be more efficient in the long term
return 307 https://$host$request_uri;
}
}
server {
# this is a copy of what's in "snippets/ssl.conf", but without claiming 'default_server'
# it is necessary in order to auto-verify the SSL config after deploying certificates.
#listen 443 ssl;
#listen [::]:443 ssl;
include "snippets/_ssl.conf";
}
NOTE: I disabled ssl in '/etc/nginx/sites-enabled/acme' because it was conflicting with gogs?? I don't know what's up with that. Gotta think through that more. Maybe ansible needs another patch.
Upload your data to private repositories.
Synchronise across devices.
Securely access your data from anywhere.
Collaborate with colleagues.
Make your data public.
Make your data citable with the GIN DOI service.
And check the user's view (notice the TLS icon is there)
[ ] Figure out uploading via git. Gogs is running ssh on port 2222, which is..weird. But let's see if I can sort that out.
[kousu@requiem ~]$ ssh -i ~/.ssh/id_ed25519.neuropoly -p 2222 git@data1.praxis.kousu.ca
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
PTY allocation request failed on channel 0
Hi there, You've successfully authenticated, but GIN does not provide shell access.
Connection to data1.praxis.kousu.ca closed.
GREAT. Now can I make this permanent?
[kousu@requiem ~]$ vi ~/.ssh/config
[kousu@requiem ~]$ tail -n 6 ~/.ssh/config
Host match *.praxis.kousu.ca
User git
Port 2222
IdentityFile ~/.ssh/id_ed25519.neuropoly
[kousu@requiem ~]$ ssh data1.praxis.kousu.ca
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
PTY allocation request failed on channel 0
Hi there, You've successfully authenticated, but GIN does not provide shell access.
Connection to data1.praxis.kousu.ca closed.
Awesome. Okay, can I use git with this?
Lets see if I can mirror our public dataset.
First, download it to my laptop (but not all of it, it's still pretty large; I ^C'd out of it):
Okay, now, make a repo on the new server: https://data1.praxis.kousu.ca/repo/create ->
Oh here's a bug; drat; I wonder if I can change the hostname gogs knows for itself, or if I need to rebuild it:
But if I swap in the right URL, and deal with git-annex being awkward, it works:
[kousu@requiem data-single-subject]$ git remote add praxis git@data1.praxis.kousu.ca:/jcohen/data-single-subject.git
[kousu@requiem data-single-subject]$ git push praxis master
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
Enumerating objects: 708, done.
Counting objects: 100% (708/708), done.
Delta compression using up to 4 threads
Compressing objects: 100% (407/407), done.
Writing objects: 100% (708/708), 142.68 KiB | 142.68 MiB/s, done.
Total 708 (delta 271), reused 708 (delta 271), pack-reused 0
remote: Resolving deltas: 100% (271/271), done.
To data1.praxis.kousu.ca:/jcohen/data-single-subject.git
* [new branch] master -> master
[kousu@requiem data-single-subject]$ git annex copy --to=praxis
(recording state in git...)
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
git-annex: cannot determine uuid for praxis (perhaps you need to run "git annex sync"?)
[kousu@requiem data-single-subject]$ git annex sync
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
commit
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
ok
pull praxis
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (2/2), 138 bytes | 138.00 KiB/s, done.
From data1.praxis.kousu.ca:/jcohen/data-single-subject
* [new branch] git-annex -> praxis/git-annex
ok
pull origin
ok
(merging praxis/git-annex into git-annex...)
(recording state in git...)
push praxis
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
Enumerating objects: 1852, done.
Counting objects: 100% (1852/1852), done.
Delta compression using up to 4 threads
Compressing objects: 100% (760/760), done.
Writing objects: 100% (1851/1851), 126.85 KiB | 15.86 MiB/s, done.
Total 1851 (delta 768), reused 1518 (delta 586), pack-reused 0
remote: Resolving deltas: 100% (768/768), done.
To data1.praxis.kousu.ca:/jcohen/data-single-subject.git
* [new branch] git-annex -> synced/git-annex
* [new branch] master -> synced/master
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
ok
push origin
Username for 'https://github.com': ^C
[kousu@requiem data-single-subject]$ git annex copy --to=praxis
copy derivatives/labels/sub-douglas/anat/sub-douglas_T1w_RPI_r_labels-manual.nii.gz
You have enabled concurrency, but git-annex is not able to use ssh connection caching. This may result in multiple ssh processes prompting for passwords at the same time.
annex.sshcaching is not set to true
Enter passphrase for key '/home/kousu/.ssh/id_ed25519.neuropoly':
(to praxis...)
ok
copy derivatives/labels/sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi_moco_dwi_mean_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_labels-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-oxfordFmrib/anat/sub-oxfordFmrib_T1w_RPI_r_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_labels-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-perform/anat/sub-perform_T1w_RPI_r_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-perform/dwi/sub-perform_dwi_moco_dwi_mean_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-tokyo750w/dwi/sub-tokyo750w_dwi_moco_dwi_mean_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-tokyoSigna2/anat/sub-tokyoSigna2_T1w_RPI_r_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-tokyoSigna2/dwi/sub-tokyoSigna2_dwi_moco_dwi_mean_seg-manual.nii.gz (to praxis...)
ok
copy derivatives/labels/sub-ucl/anat/sub-ucl_T1w_RPI_r_labels-manual.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_T1w.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_T2star.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_T2w.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-chiba750/anat/sub-chiba750_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-chiba750/dwi/sub-chiba750_dwi.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_T1w.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_T2star.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_T2w.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/anat/sub-chibaIngenia_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-chibaIngenia/dwi/sub-chibaIngenia_dwi.nii.gz (to praxis...)
ok
copy sub-douglas/anat/sub-douglas_T1w.nii.gz (to praxis...)
ok
copy sub-douglas/anat/sub-douglas_T2star.nii.gz (to praxis...)
ok
copy sub-douglas/anat/sub-douglas_T2w.nii.gz (to praxis...)
ok
copy sub-douglas/dwi/sub-douglas_dwi.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_T1w.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_T2star.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_T2w.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-glen/anat/sub-glen_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-glen/dwi/sub-glen_dwi.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_T1w.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_T2star.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_T2w.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-juntendo750w/anat/sub-juntendo750w_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-juntendo750w/dwi/sub-juntendo750w_dwi.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_T1w.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_T2star.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_T2w.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/anat/sub-juntendoAchieva_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoAchieva/dwi/sub-juntendoAchieva_dwi.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_T1w.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_T2star.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_T2w.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/anat/sub-juntendoPrisma_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoPrisma/dwi/sub-juntendoPrisma_dwi.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_T1w.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_T2star.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_T2w.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_acq-MToff_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_acq-MTon_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/anat/sub-juntendoSkyra_acq-T1w_MTS.nii.gz (to praxis...)
ok
copy sub-juntendoSkyra/dwi/sub-juntendoSkyra_dwi.nii.gz (to praxis...)
ok
copy sub-mgh/anat/sub-mgh_T1w.nii.gz (to praxis...)
ok
(recording state in git...)
Using a non-standard ssh port is a problem. I know of five solutions:
1.
export GIT_SSH_COMMAND="ssh -p 2222"
or
GIT_SSH_COMMAND="ssh -p 2222" git <subcommand> ....
Each user uses this each time they use the server.
Each users adds this one-time to each new machine, at the same time as they provide their ssh key.
cat >> ~/.ssh/config <<EOF
Host match *.praxis.kousu.ca
Port 2222
EOF
ComputeCanada lets us allocate multiple IP addresses per machine. The inscription form asks if you want 1 or 2. If we had a second IP address, we could bind one of them to the underlying OS and the other to GIN.
Here's someone claiming to do this with Gitlab+docker: https://serverfault.com/a/951985
Just swap the two ports:
Port 2222
docker run -p 3000:3000 -p 22:22 -d gnode/gin-web:live
Then the sysadmins need to know to use
ssh -p 2222 ubuntu@data1.praxis.kousu.ca
when they need to log in to fix something. That will hopefully be pretty rare, though. They could even do this:
cat >> ~/.ssh/config <<EOF
Host sysadmin-data1.praxis
HostName data1.praxis,kousu.ca
Port 2222
EOF
And users don't need to do anything special.
The docker image comes with a built in ssh server. If we install GIN on the base system and share the system ssh there won't be a second port to worry about.
This is more work because it requires rebuilding their package in a non-docker way. It's my preference though. I would like to build a .deb so you can "apt-get install gin" and have everything Just Work.
We could also make this package deploy dehydrated
and nginx
as above, to save even more time for the users.
Demo day went well by the way. https://praxisinstitute.org/ seems happy to investigate this direction.
Shortcuts taken that should be corrected:
dehydrated
cronjobs to renew certsunattended-upgrades
to keep the system up to dateAll of these could be fixed quick by bringing this server under ansible, but I wrote into the ansible scripts the assumption that all servers are under *.neuro.polymtl.ca, so I'd need to fix that first.
Also
I've been working on adding this to the lab's configuration management today (for those who have access, that's at https://github.com/neuropoly/computers/pull/227). To that end, I'm re-purposing the resources allocated to praxis-gin to be for vaughan-test.neuropoly.org, which will be our dev server for data.neuropoly.org.
And https://data.neuropoly.org will be just as good a demo for Praxis the next time we talk with them. And with the ansible work you're doing it will even be reproducible for them to build their own https://data.praxisinstitute.org :)
Some competitors:
Possible collaborators:
Portals (where we could potentially get ourselves listed, especially if we help them out by making sure we have APIs available):
Part of our promise of standards-compliant security was to run fail2ban
but maybe pam_faillock
is an easier choice (https://github.com/neuropoly/computers/issues/168#issuecomment-1008239662). But perhaps not.
We had a meeting with Praxis today:
Some other related work that came up:
@taowa is going to contact ComputeCanada asking them to extend our allocation on https://docs.computecanada.ca/wiki/Cloud_resources#Arbutus_cloud_.28arbutus.cloud.computecanada.ca.29 from 2 VPSes to 3 -- one for data-test.neuropoly.org, one for data.neuropoly.org, and one for data.praxisinstitute.org.
We've done a lot of work in our private repo at https://github.com/neuropoly/computers/issues/167 (the praxis-specific part is at https://github.com/neuropoly/computers/pull/332) on this. We've got an ansible deployment, a fork of gitea (https://github.com/neuropoly/gitea/pull/1/); and have a demo server at https://data.praxisinstitute.org.dev.neuropoly.org/. Eventually we will want to extract those ansible scripts and publish them on Galaxy
Today we talked to Praxis and David Cadotte again and got an update on how their data negotiations are going:
Each site is very different and needs help adapting to their environment; they have different PACS, different OSes, different levels of familiarity with the command line. David has been spending time giving tech support to some of the sites' curators to help get their data in BIDS format. We have created a wiki here to gather the information David has been teaching and anything we learn during our trial dataset uploads; it's here on GitHub but could migrated to https://data.praxisinstitute.org, once that's live (and eventually perhaps these docs could even be rolled into the ansible deployment, as a standard part of Neurogitea?).
We will be in touch with Praxis's IT team in the next couple weeks so we can migrate https://data.praxisinstitute.org.dev.neuropoly.org -> https://data.praxisinstitute.org.
We got some branding feedback from Praxis Institute for the soon-to-be https://data.praxisinstitute.org:
We have received some feedback from our director of marketing, and she really liked the website header colors (no need to change the text). She did provide couple of suggestions:
- The logo resolution is appearing quite low on the website header (compared to the tagline text), could you please adjust it to look more as in the svg logo file?
- Is it possible to use a word mark for Neurogitea? Something similar to the one attached would be great!
- For paragraph text, will it be possible to have a white or very light background with dark text colour (black, dark grey, etc.)?
(Note that the current demo simply uses the arc-green
theme which comes bundled with Gitea, as described in this section of the Gitea customization docs.)
arc-green
(source link) into a new theme for Gitea to use. I'll note some colours used by the Praxis Institute website:
You extracted wordmark.svg! Niiice.
On my end, I emailed R. Foley at Praxis to ask to get DNS reassigned to our existing ComputeCanada instances, so that we will have https://data.praxisinstitute.org in place of https://data.praxisinstitute.org.dev.neuropoly.org.
EDIT: R. Foley got back to say that they don't want to give us praxisinstitute.org, but will talk to their marketing team and decide on an appropriate domain we can have.
On the demo server I just saw this probe from Mongolia:
180.149.125.168 - - [07/May/2022:19:21:42 -0400] "GET /c/ HTTP/1.1" 307 180 "-" "Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"
so it occurs to me that maybe we should impose geoblocking on Praxis's server. It's meant to be a pan-Canada project, so maybe we should impose firewall rules that actually enforce that it's only pan-Canadian. That's a bit tricky to do; I guess we can extract ip blocks from MaxMind's geoip-database
and feed them into iptables?
* [ ] For point 3, we should be able to tweak `arc-green` ([source link](https://github.com/go-gitea/gitea/blob/main/web_src/less/themes/theme-arc-green.less)) into a new theme for Gitea to use. I'll note some colours used by [the Praxis Institute website](https://praxisinstitute.org/): * dark header background: #161616 * light body background: #fefefe * wordmark text: #000000 * paragraph text: #646464 * blue logo: #00bed6
theme-praxis.css:
:root {
--color-body: #fefefe;
--color-text: #646464;
--color-primary: #00bed6; /* blue of the logo */
--color-primary-dark-1: #00a6bb;
--color-secondary: #b4008d; /* purple used to give some visual contrast */
--color-secondary-dark-1: #8a016c;
--color-warning-bg: #ffb600; /* yellow used as a tertiary colour */
--color-warning-bg-dark-1: #D99C03;
--color-menu: var(--color-body);
}
.following.bar.light {
/* nav bar is dark themed, in contrast to the rest of the site */
--color-body: #161616;
--color-text: #bbc0ca;
}
.following.bar.light .dropdown .menu {
/* but dropdowns within the navbar are back to light themed */
--color-body: #fefefe;
--color-text: #646464;
}
.ui.basic.green.buttons .button:hover, .ui.basic.green.button {
color: var(--color-primary);
}
.ui.basic.green.buttons .button:hover, .ui.basic.green.button:hover {
color: var(--color-primary-dark-1);
}
.ui.green.buttons .button, .ui.green.button {
background-color: var(--color-primary);
}
.ui.green.buttons .button, .ui.green.button:hover {
background-color: var(--color-primary-dark-1);
}
.ui.red.buttons .button, .ui.red.button {
background-color: var(--color-warning-bg);
}
.ui.red.buttons .button, .ui.red.button:hover {
background-color: var(--color-warning-bg-dark-1);
}
Looks like:
I went one step slightly further than the default themes and themed the yes/no buttons as blue/yellow (matching colours I got off https://praxisinstitute.org/) instead of the default green/red.
I'll integrate this into ansible this afternoon.
After that I'll replace the logos.
* [ ] For point 1, it was pixelated because we had been using an earlier PNG version of the logo, but we have a nice SVG version now. I'm attaching two slightly modified versions of the logo here: * [praxis.svg](https://user-images.githubusercontent.com/928742/167062346-6c35192c-fd87-4052-8456-b005fd8dbab1.svg) is a version with black text, suitable for use on a light background (for example, the big main logo)
I followed the instructions:
make generate-images
But it came out badly:
After mulling for a while, I opened it up in Inkscape and saw the problem: the viewport is bizarrely too large. It's set to
viewBox="0 0 1520 1230"
With Inkscape helping me measure things, I worked out that the tight viewport is
viewBox="300 297 935 663"
So here's that file: praxis.svg
With this, it's a lot better:
* [praxis-rev.svg](https://user-images.githubusercontent.com/928742/167062353-93ce125e-5aac-48b7-80f7-e30ac9494946.svg) is a version with white text, suitable for use on a dark background (for example, in the page header if that stays dark).
Thanks, but I think I'm going to end up skipping praxis-rev.svg. For one thing, I'd prefer to control its colours via CSS in the theme file (which is a thing you can do with svgs) for another, the logo is really way too small with the text attached for the navbar, so I'm just going to cut it off and leave the blue butterfly-spine, which isn't light/dark sensitive.
And here's that file: logo.svg
With this, the navbar looks better:
but now the cover page is missing the title, because the cover logo and the navbar logo are the same file, so I need to separate the two and customize the cover page to know that. I have to anyway to handle the wordmark part.
I put
cp praxis.svg custom/public/img/logo-home.svg
$ find . -name home.tmpl
./templates/org/home.tmpl
./templates/home.tmpl
./templates/repo/home.tmpl
$ mkdir -p custom/templates
$ cp templates/home.tmpl custom/templates/
$ vi custom/templates/home.tmpl
And made this:
{{template "base/head" .}}
<div class="page-content home">
<div class="ui stackable middle very relaxed page grid">
<div class="sixteen wide center aligned centered column">
<div>
<img class="logo" width="220" height="220" src="{{AssetUrlPrefix}}/img/logo-home.svg"/>
</div>
<div class="hero">
<h1 class="ui icon header title">
{{AppName}}
</h1>
<h2>{{.i18n.Tr "startpage.app_desc"}}</h2>
</div>
</div>
</div>
</div>
{{template "base/footer" .}}
- {{.i18n.Tr "startpage.install_desc" | Str2html}} -
-- {{.i18n.Tr "startpage.platform_desc" | Str2html}} -
-- {{.i18n.Tr "startpage.lightweight_desc" | Str2html}} -
-- {{.i18n.Tr "startpage.license_desc" | Str2html}} -
-And now I've got
Which seems to be coming along nicely.
And finally I regenerated the images
cp logo.svg assets/logo.svg
make generate-images
cp public/img/{apple-touch-icon.png,avatar_default.png,{favicon,logo}.{png,svg}} custom/public/img # or somewhere else to stage it
EDIT: turns out, as of yesterday, now there's an extra step:
cp logo.svg assets/logo.svg
cp assets/logo.svg assets/favicon.svg # see https://github.com/go-gitea/gitea/pull/18542
make generate-images
cp public/img/{apple-touch-icon.png,avatar_default.png,{favicon,logo}.{png,svg}} custom/public/img # or somewhere else to stage it
* [ ] For point 2, I managed to get an SVG version of the wordmark (that is, the word "Neurogitea" but in a fancy font) which doesn't depend on specific fonts being installed on the viewer's computer: [wordmark.svg](https://user-images.githubusercontent.com/928742/167062360-18c799d9-4da6-4911-bfa6-a064f7e115a3.svg)
For this, I put
cp wordmark.svg custom/public/img/neurogitea-wordmark.svg
And did this patch to what I had above:
diff --git a/custom/public/css/theme-praxis.css b/custom/public/css/theme-praxis.css
index a1665744f..d8cf3faea 100644
--- a/custom/public/css/theme-praxis.css
+++ b/custom/public/css/theme-praxis.css
@@ -47,3 +47,9 @@
.ui.red.buttons .button, .ui.red.button:hover {
background-color: var(--color-warning-bg-dark-1);
}
+
+/* the neurogitea wordmark needs some CSS resets to display properly */
+.ui.header > img.logo {
+ max-width: none;
+ width: 500px;
+}
diff --git a/custom/templates/home.tmpl b/custom/templates/home.tmpl
index d7d1d8501..2aaaa176b 100644
--- a/custom/templates/home.tmpl
+++ b/custom/templates/home.tmpl
@@ -7,7 +7,7 @@
</div>
<div class="hero">
<h1 class="ui icon header title">
- {{AppName}}
+ <img class="logo" src="{{AssetUrlPrefix}}/img/neurogitea-wordmark.svg"/>
</h1>
<h2>{{.i18n.Tr "startpage.app_desc"}}</h2>
</div>
And now I've got
Some other praxis-specific things to include:
# app.ini
APP_NAME = Neurogitea
[ui]
THEMES = praxis
DEFAULT_THEME = praxis
[ui.meta]
AUTHOR = "Praxis Spinal Cord Institute"
DESCRIPTION = "Neurogitea connects spinal cord researchers with each others' data"
KEYWORDS = "bids,data sharing,git-annex,datalad,git-lfs,reproducible science" # ?
Theming is sitting on https://github.com/neuropoly/computers/pull/332/commits/b365da08c69c67509bbcdcbffe3348cda521cfd0 (sorry it's in the private repo; extracting and publishing to Ansible Galaxy will be a Real-Soon-Now goal)
On my end, I emailed R. Foley at Praxis to ask to get DNS reassigned to our existing ComputeCanada instances, so that we will have https://data.praxisinstitute.org in place of https://data.praxisinstitute.org.dev.neuropoly.org.
EDIT: R. Foley got back to say that they don't want to give us praxisinstitute.org, but will talk to their marketing team and decide on an appropriate domain we can have.
They've made a decision: spineimage.ca. I've asked them to assign
spineimage.ca 206.12.97.250
drone.spineimage.ca 206.12.93.20
When that's done, I'll add those domains in https://github.com/neuropoly/computers/pull/332; and then we should maybe think about pointing data.praxisinstitute.org.dev.neuropoly.org back at some servers on Amazon again to host a staging server we can use without knocking out their prod server.
We had a meeting today with Praxis, including the first trial data curator.
David Cadotte had helped her already curate the dataset into BIDS. We successfully uploaded it to https://spineimage.ca/TOH/site_03.
David Cadotte has a draft curator tutorial :lock_with_ink_pen: . I started the same document on the wiki here but his is further along.
The next step is that David, me and the trial curator are going to upload a trial dataset to https://data.praxisinstitute.org.dev.neuropoly.org/ together. We will be picking a meeting time via Doodle soon.
The curator has been using bids-validator, but it sounded like they were using the python version, not the javascript one. The javascript one is incomplete but the python version is even more incomplete. This is something I should check on when we get together to upload the dataset.
In parallel, we will finish up migrating to spineimage.ca, the "prod" site, and sometime next month we should have 4 - 5 curators ready.
We'll have to repurpose the existing VMs to become prod. But I would like to keep the staging site so we can have something to experiment on. I could experiment locally but I don't have an easy way to turn off or mock https, so it's simpler just to have a mock server with a real cert from LetsEncrypt. I'll rename it spinalimage.ca.dev.neuropoly.org
But there's a problem: ComputeCanada gave has given us three VMs but only two public IPs; the current version of neurogitea needs two IPs.
Some ideas:
autossh
(or maybe even wireguard?)
Summary for today: we were able to connect with Lisa J. from the site_012, and had some pretty good success.
Also, we noticed that for site_03, the participants.json
file contains row data, which should only be in participants.tsv
, so we should open an issue with Maryam to fix it, and make the curation documentation clearer on that point.
Lisa didn't yet have her dataset curated. We got halfway through curation, and did not start at all on uploading. Also like @mguaypaq said, we were figuring out Windows remotely as we went, never having used much of this software stack ourselves there.
We didn't have Administrator on the computer she was working on. The git
installer was able to handle this by itself, but we had to tweak the installer settings for both git-annex
and python
to make sure to install to C:\Users\%USER\AppData\Local\Programs
and not to try to install anything system-wide.
dcm2niix
continues to be tricky because it doesn't have an installer. It just has zip files of binaries. We put it in C:\Users\%USER%\bin, because git-bash has that on its $PATH
, but it's unclear if that's a good long-term recommendation. It's in apt
and brew
, and there's a conda
package that could be used on Windows, if we were to get people to install conda
first.
By the way, we skipped using pycharm or a virtualenv, and that worked fine. Our curators are not often developers, so explaining virtualenvs is a whole extra can of worms that derails the training. venv
only helps when you're developing many python projects and those projects have incompatible dependencies -- regular end users should just be able to pip install
anything and mostly have things work (and if they don't, it should be a bug on the shoulders of the developers of those softwares).
Put backups on https://docs.computecanada.ca/wiki/Arbutus_Object_Storage. Even if backups are encrypted, the data sharing agreement says backups need to stay within the cluster.
Today we had a meeting for 2 hours.
We gave @jcohenadad an account on https://spineimage.ca/, and made sure David Cadotte and our site_012 curator remembered their passwords too.
David Cadotte helped the site 12 curator use dcm2bids_helper
and dcm2bids
to construct and test a dcm2bids config file. It is a tedious process, doing:
dcm2bids_helper -d sourcedata/$subject_id --force
tmp_dcm2bids/helper/*.json
filescode/dcm2bids_config.json
that matches (SeriesDescription
, ProtocolName
) from those JSON files with (dataType
, modalityLabel
, customLabels
).dcm2bids -c code/dcm2bids_config.json -d sourcedata/$subject_id -p $sequential_id
David estimated it takes half an hour per subject, even once the curator is fluent with it.
(note that the mapping between $subject_id
and $sequential_id
is secret, and maintained by individual curators and Praxis)
We decided to update the curation protocol by adding a site prefix to the subject IDs, e.g. hal019
for subject 19 from Halifax, ott002
for subject 2 from Ottawa.
@mguaypaq has helped me merge in HTTP downloads, which means we now can host open access datasets on any server we choose, in any country we can find one.
I've tested it by moving a complete copy of https://github.com/spine-generic/data-single-subject onto https://data.dev.neuropoly.org/: https://data.dev.neuropoly.org/nick.guenther/spine-generic-single/.
I went to its settings to make it Public:
then I downloaded anonymously, with the same commands we tell people to currently use against the GitHub copy:
Also to note: because Push-to-Create is turned on in our gitea config, there was very little fumbling around. A single git annex sync --content
should be enough to upload everything, no messing with the web UI.
To emphasize again: we now have a fully alternate copy of https://github.com/spine-generic/data-single-subject. Identical download instructions work for it; all someone has to do is swap in https://data.dev.neuropoly.org/nick.guenther/spine-generic-single/. :tada: (fair warning though: this is a dev server). And with this arrangement, all bandwidth -- git and git-annex -- is paid for through DigitalOcean, instead of splitting the bill between GitHub and Amazon. And when we do promote this to a production server, we can get rid of the difficult contributor AWS credentials.
Unfortunately I've already found one bug that we missed in https://github.com/neuropoly/gitea/issues/19 but it's minor.
Backups
Put backups on https://docs.computecanada.ca/wiki/Arbutus_Object_Storage. Even if backups are encrypted, the data sharing agreement says backups need to stay within the cluster.
I'm working on this now. That wiki page is helpful but I still have to fill in some details, which I am writing down here:
Get dependencies.
You need openstack
. I'm on Arch but this should be portable to Ubuntu:
sudo pacman -S python-openstackclient
Go to https://arbutus.cloud.computecanada.ca/auth/login/?next=/project/ and log in
Download the OpenStack RC file from it.
Load the OpenStack RC file.
$ . ~/def-jcohen-dev-openrc.sh
Please enter your OpenStack Password for project def-jcohen-dev as user nguenthe: [ TYPE ARBUTUS PASSWORD HERE ]
Create an S3 token.
Tokens are something I can give out to the backup bot without compromising my complete account.
openstack ec2 credentials create
Generate a restic
password
Now, move to the target server and install restic
Still on the target server, provide the credentials to restic
:
Create the backup repo
At this point, we only need to keep the restic config. We can toss the OpenStack RC file; if we need it again, we can grab it again.
Now how to interface Gitea with restic?
According to https://docs.gitea.io/en-us/backup-and-restore, backing up Gitea is the standard webapp process: dump the database and save the data folder,. It has a gitea dump
command, but warns
Gitea admins may prefer to use the native the MySQL and PostgreSQL dump tools instead. There are still open issues when using XORM for dumping the database that may cause problems when attempting to restore it.
and I indeed ran into this while experimenting a few months ago: you cannot restore a gitea dump
, especially if you've gone through a few Gitea versions since taking the backup, so I don't trust gitea dump
; all it does is basically run pg_dump > data/gitea-db.sql
and then zip the data/ folder. Plus, zipping is slow (it zips the repos!) and may actually make restic
work worse by interfering with its own compression.
Also, restoring is a manual process:
There is currently no support for a recovery command. It is a manual process that mostly involves moving files to their correct locations and restoring a database dump.
So I'm going to ignore gitea dump
and write my own backup script.
From the docs:
Buckets are owned by the user who creates them, and no other user can manipulate them.
I can make tokens for everyone who will be adminning this server, but as far as ComputeCanada is concerned, all of them are me. I don't know how to handle this. Maybe when I eventually hand this off to someone else we'll have to copy all the backups to a new bucket.
According to restic, it can talk to OpenStack directly, without using the S3 protocol. But I got it working through S3 and I think that's fine.
My existing backup scripts on data.neuro.polymtl.ca
(#20) look like
git@data:~$ cat ~/.config/restic/s3
RESTIC_REPOSITORY=s3:s3.ca-central-1.amazonaws.com/data.neuro.polymtl.ca.restic
RESTIC_PASSWORD=xxxxxxxxxxxxxxxxxxxxxxxxxxx
AWS_ACCESS_KEY_ID=aaaaaaaaaaaaaaaaaaaa
AWS_SECRET_ACCESS_KEY=kkkkkkkkkkkkkkkkkkkk
git@data:~$ cat ~/.config/restic/CC
RESTIC_REPOSITORY=sftp://narval.computecanada.ca/:projects/def-jcohen/data.neuro.polymtl.ca.restic
RESTIC_PASSWORD=xxxxxxxxxxxxxxxxxxxxxxxxxxxx
git@data:~$ cat /etc/cron.d/backup-git
# daily backups
0 2 * * * git (set -a; . ~/.config/restic/s3; cd ~; chronic restic backup --one-file-system repositories)
0 3 * * * git (set -a; . ~/.config/restic/CC; cd ~; chronic restic backup --one-file-system repositories)
# backup integrity checks
0 4 */3 * * git (set -a; . ~/.config/restic/s3; chronic restic check --read-data-subset=1/27)
0 5 */3 * * git (set -a; . ~/.config/restic/CC; chronic restic check --read-data-subset=1/9)
# compressing backups by pruning
0 6 * * * git (set -a; . ~/.config/restic/s3; chronic restic forget --prune --keep-daily 7 --keep-weekly 5 --keep-monthly 12 --keep-yearly 3)
0 7 * * * git (set -a; . ~/.config/restic/CC; chronic restic forget --prune --keep-daily 7 --keep-weekly 5 --keep-monthly 12 --keep-yearly 3)
That's for Gitolite. Porting this to Gitea is tricky because of the required downtime. This requirement seems to complicate everything because I want backups to run as gitea
, but this forces part of the script to run as root
; which means either the whole script needs to start as root then drop privileges, or start as gitea
and use sudo
to gain privileges, and it's always risky to try to do limited grants with sudo.
I'm tempted to ignore this requirement. I did some experiments and found that git annex sync --content
transfers (which use rsync
underneath) continue even after systemctl stop gitea
, and git push
transfers too, so there's no way to get a 100% consistent snapshot anyway.
I'm going to compromise:
# daily backups
0 1 * * * root (systemctl stop gitea && su -c 'pg_dump gitea > ~gitea/gitea-db.sql' gitea; systemctl restart gitea)
0 2 * * * gitea (set -a; . ~/.config/restic/CC; cd ~; chronic restic backup --one-file-system gitea-db.sql data)
# backup integrity checks
0 4 */3 * * gitea (set -a; . ~/.config/restic/CC; chronic restic check --read-data-subset=5G)
# compressing backups by pruning
0 6 * * * gitea (set -a; . ~/.config/restic/CC; chronic restic forget --prune --keep-daily 7 --keep-weekly 5 --keep-monthly 12 --keep-yearly 3)
This way, while the database and contents of data/ may drift a little apart, the worst that will happen is there are some commits in some repos that are newer than the database, or there are some avatars or other attachments that the database doesn't know about.
I'm working on coding this up in https://github.com/neuropoly/computers/pull/434
EDIT: in that PR, I decided to ignore backup consistency. I think it will be okay. Only a very busy server would have problems anyway, which our servers will definitely not be, and I'm not even convinced it's that big a problem if the git repos and avatars are slightly out of sync with the database. And Gitea already has code to handle resyncing at least some cases, because digital entropy can always cause mistakes. I think at worst, it may fall back to using an older avatar for one person.
I merged neurogitea backups today and wanted to use them for this prod server. But first I had to
I used do-release-upgrade
to upgrade drone.spineimage.ca
and spineimage.ca
.
There was a snag: the upgrade killed postgres-12 and replaced it with postgres-14; it sent me an email warning me to run pg_upgradecluster 12 main
before continuing, but I ignored that and ran apt-get autopurge
over eagerly. So I lost the database. :cry:
Luckily, I had backups from December (taken above). I did apt-get purge postgresql-common
, redeployed, and then followed my own docs to get it back:
root@spineimage:~# systemctl stop gitea
root@spineimage:~# su -l gitea -s /bin/bash
$ bash
gitea@spineimage:~$ restic-no arbutus snapshots
repository 2d22bf7f opened successfully, password is correct
ID Time Host Tags Paths
---------------------------------------------------------------------------------
2547ebc9 2022-11-30 21:29:17 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
04e8abc1 2022-11-30 21:56:17 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
e66de5bb 2022-12-01 00:20:08 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
95325d1b 2022-12-01 00:20:29 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
4a530419 2022-12-01 00:20:57 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
ae839c6c 2022-12-01 00:26:27 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
197c4af8 2022-12-01 00:26:47 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
48f35777 2022-12-01 00:27:49 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
0ea0845d 2022-12-01 00:28:49 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
342d08dd 2022-12-01 01:18:25 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
221b5622 2022-12-01 01:20:02 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
4c3d1c67 2022-12-01 01:55:42 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
5b49742d 2022-12-01 01:56:52 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
8cc26371 2022-12-01 01:57:49 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
420e46d9 2023-02-09 23:02:48 spineimage.ca /srv/gitea/data
/srv/gitea/gitea-db.sql
---------------------------------------------------------------------------------
15 snapshots
gitea@spineimage:~$ restic-no arbutus restore latest --include gitea-db.sql --target /tmp/r
repository 2d22bf7f opened successfully, password is correct
restoring <Snapshot 420e46d9 of [/srv/gitea/gitea-db.sql /srv/gitea/data] at 2023-02-09 23:02:48.947750225 -0500 EST by gitea@spineimage.ca> to /tmp/r
gitea@spineimage:~$ psql gitea < /tmp/r/gitea-db.sql
SET
SET
SET
SET
SET
set_config
------------
(1 row)
SET
SET
SET
SET
SET
SET
ERROR: relation "access" already exists
[...]
ERROR: relation "UQE_watch_watch" already exists
ERROR: relation "UQE_webauthn_credential_s" already exists
gitea@spineimage:~$ exit
unfortunately I didn't take the backup with pg_dump --clean --if-exists
, so I got all these errors. So I manually recreated an empty DB:
root@spineimage:~# sudo -u postgres psql
psql (14.6 (Ubuntu 14.6-0ubuntu0.22.04.1))
Type "help" for help.
postgres-# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+---------+-----------------------
gitea | gitea | UTF8 | C.UTF-8 | C.UTF-8 |
postgres | postgres | UTF8 | C.UTF-8 | C.UTF-8 |
template0 | postgres | UTF8 | C.UTF-8 | C.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C.UTF-8 | C.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)
postgres-# drop database gitea
postgres=# create database gitea with owner gitea;
CREATE DATABASE
postgres-# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+---------+-----------------------
gitea | gitea | UTF8 | C.UTF-8 | C.UTF-8 |
postgres | postgres | UTF8 | C.UTF-8 | C.UTF-8 |
template0 | postgres | UTF8 | C.UTF-8 | C.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C.UTF-8 | C.UTF-8 | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)
postgres-# \q
and reloaded again:
root@spineimage:~# su -l gitea -s /bin/bash
gitea@spineimage:~$ psql gitea < /tmp/r/gitea-db.sql
SET
[...]
CREATE INDEX
gitea@spineimage:~$ exit
root@spineimage:~# systemctl restart gitea
The redeploy above took care of upgrading Gitea, which went smoothly. It is now at 1.18.3+git-annex-cornerstone
and it does the inline previews (but I can't demo that here because it's private data).
I just noticed that ComputeCanada is affiliated with https://www.frdr-dfdr.ca/, sponsored by the federal government. They use Globus to upload data where we use git-annex. Should we consider recommending that instead of git?
They don't seem to do access control:
Anyone may use FRDR to search for and download datasets. You do not need to have a Globus Account affiliated with a Canadian postsecondary institution to download datasets in FRDR using Globus.
which rules them out for our use case. Also Globus doesn't do versioning, I'm pretty sure. For example, just look through https://www.frdr-dfdr.ca/discover/html/repository-list.html?lang=en and find, say, https://www.frdr-dfdr.ca/repo/dataset/6ede1dc2-149b-41a4-9083-a34165cb2537 and that doesn't show anything labelled "versions" as far as I can see.
Today our site 12 curator was able to finish a dcm2bids config file and curate all subjects from her site, with David's advice. We included several t2starw images, that initially David thought we should drop until we realized they made up a large portion of Lisa's dataset.
One subject was dropped -- the previous sub-hal001 -- and the others subject IDs renumbered to start counting at sub-hal001.
There was also some debate about whether to tag sagittal scans with acq-sag or acq-sagittal. At poly we've used acq-sagittal but Ottawa's dataset uses acq-sag. At neuropoly we will standardize on this internally to match.
Right now curators have to manually run dcm2bids for each subject. @valosekj and @mguaypaq and I think it should be possible to write the loop for this, and write that into a script in each dataset's code/ folder. That would make curation more robust. @valosekj pointed out we can store the imaging IDs (the IDs used for each dataset's sourcedata/${ID}
) in participants.tsv
, using the 'source_id' column that we've standardized on. I think we can probably write a loop script that we can share between curators that reads their participants.tsv
to know what sourcedata/ folder to look at (and maybe even patch dcm2bids with a batch mode that understands doing just that). That would be a lot more reliable than having curators run through the curation subject by subject every time they tweak the config file.
We have not yet run bids-validator on the dataset.
We spent a while trying to get the curator able to upload to spineimage.ca. I hoped we could start by committing and publishing the config file, and then adding subject data in steps, and refining the config file with more commits from there. I hoped it would be quick but we hit bugs and ran out of time for today. ssh
is giving this error:
We double-checked by using a few different servers and ports and also an PuTTY, an entirely separate program that should have nothing to do with the ssh that comes with Windows Git Bash
In all cases a closed port times out, an open port gives this error, after a successful TCP handshake. I remember the connection working back in July when we first were in touch with this site, but since then their hospital IT has done upgrades and now there seems to be some sort of firewall in the way. I will follow up with the curator by email with specific debugging instructions she can relay to her IT department.
Today, because Halifax's IT department is apparently backlogged by months, we debugged further but without success. We created a short python script that basically just implements printf 'GET /\r\n\r\n' | nc $HOST $PORT
and ran it against a few combinations or hosts and ports. In all cases, this worked, but when using an SSH client we are blocked. So SSH seems to be blocked as a protocol: when connecting over to port 443 on spineimage.ca via https://spineimage.ca:443, it worked, but when I reconfigured the server to run ssh on that same port and we tried both ssh://spineimage.ca:443 and sftp://spineimage.ca:443, it timed out; similarly, ssh -v
reports "connected" before hanging and reporting "software caused connection abort" -- so the TCP connection is allow to port 22, but somehow to content of that connection is tripping it up. I suspect there is a deep-packet-inspecting firewall involved that is specifically blocking ssh :anger: .
As a workaround, we constructed a .tar.gz of the dataset and encrypted it using these tools https://stackoverflow.com/a/16056298/2898673. We resulted in a 1GB encrypted file, site_012.tar.gz.enc that's on Lisa's office machine. Mathieu and I have the decryption key saved securely on our machines. The curator is going to try to use a file hosting service NSHealth runs. If that fails, it might be possible to create an Issue in https://spineimage.ca/NSHA/site_012/issues and drag-and-drop the file onto it (so it becomes an attachment there), and in the worst case, it can be mailed on a thumbdrive to us at
Julien Cohen-Adad Ecole Polytechnique, Genie electrique 2500, Chemin Polytechnique, Porte S-114 Montreal, QC H3T 1J4 Canada
(ref: https://neuro.polymtl.ca/contact-us.html)
In the future, perhaps as a different workaround we can set up an HTTPS proxy; if the problem is a deep-packet inspecting firewall, wrapping the ssh connection in an HTTPS one should defeat it. I believe these instructions solve that https://stackoverflow.com/a/23616021/2898673 and we can pursue that in the future if/when we need to do this again.
EDIT: the curator created an account for us on https://sfts1.gov.ns.ca/ and sent us the encrypted file, and @ mguaypaq was able to download it and upload the contents to https://spineimage.ca. On Thursday, May the 18th, there was a final meeting to demonstrate to the Halifax curator that their work was done as far as it can be for now. If we need their help making edits to the data we will be right back here unless we figure out some kind of proxy situation.
https://praxisinstitute.org wants to fund a Canada-wide spine scan sharing platform.
They were considering paying OBI as a vendor, and having them set up a neuroimaging repository. But had doubts about the quality of that solution and have looked around for others, and have landed on asking us for help.
We've proposed a federated data sharing plan and they are interested in pursuing this line.
Needs