neuropoly / coda

Deployment and testing repository for https://www.coda-platform.com/
GNU General Public License v3.0
0 stars 0 forks source link

Deploy CODA at neuropoly #1

Closed namgo closed 1 year ago

namgo commented 1 year ago

I got a message from @jcohenadad today about the (good) option to take Bireli out of the network temporarily for usage in the CODA project so we don't have to deal with so many permission errors.

This will take Bireli out of the sysadmin team's control to some extent but I think it's necessary given the time-crunch that Louis-Francois is under. @kousu What needs to happen so we can do this? As I understand, we've already given Louis-Francois root, but is there more we should be doing?

Is there any alerting we should decommission?

namgo commented 1 year ago

We could probably use Bireli as our toss-at-projects-that-need-root machine in the future since it's largely unused. We'd want to discuss this formally though.

kousu commented 1 year ago

We delete

https://github.com/neuropoly/computers/blob/16a6de2b6114ee49f694de51b0354320c51c141a/ansible/hosts#L19

and then we have to manually uninstall netdata and maybe some other stuff like tmpreaper and sssd (roles/neuropoly.grames) -- ansible doesn't have any way to propagate deleting that line into removing everything that line implied.

kousu commented 1 year ago

We could probably use Bireli as our toss-at-projects-that-need-root machine in the future since it's largely unused. We'd want to discuss this formally though.

I'd rather solve handle "projects-that-need-root" with neuropoly/computers#461 ! And/or containers. podman can handle 99% of what people think they need root for but don't really.

The CODA project is an exception, because they have a multi-stage setup and ansible scripts that make strong assumptions. What we could be doing is running our own internal cloud (maybe openstack but neuropoly/computers#461 is enough) that we can point such scripts at. And also we could be working with them to see if they can adapt their ansible scripts into ansible roles that might be willing to play nicer with pre-existing ansible deployments.

namgo commented 1 year ago

I got the go-ahead from Julien to work with Louis-Francois on CODA. I think my first step will be going over the requirements with Louis-Francois, to figure out what we actually need and how to set it up.

namgo commented 1 year ago

CODA provides deployment documentation (https://github.com/coda-platform/guides-and-policies/tree/main/guides/deployment).

I misunderstood some parts of this project, we do in fact need a GPU node (which Bireli provides), and the system requirements for individual nodes are pretty steep.

I'm going to have a call with one of the developers of CODA and Louis-Francois next week.

namgo commented 1 year ago

CODA would be really difficult to provision without caprover. Their caprover config https://github.com/coda-platform/site-deployer might as well be a script to tell docker to deploy things and which ports are where but I don't know if it's worth the effort/risk to try to rewrite their system into ansible.

They are planning to use ansible in the long-term for prod but are not at that point yet.

I'm going ahead and removing Bireli so I can deploy docker to it (https://github.com/neuropoly/computers/commit/344cb13fa749eae75ce3e9f12a4869355a8d9fd1).

namgo commented 1 year ago

I'm adding a new config for coda-resolver in dnsmasq which I'll call directly for now, setting subdomains of .coda to resolve locally it'd seem. I'm not sure how well this will work with remote connections, we'll see (it won't, but I might have a workaround).

I got confused about the site-deployer and installed it directly to bireli, I'm installing it to VMs now.

namgo commented 1 year ago

Alright, so I have a working-ish solution where Bireli resolves hub and site1 locally, but this is only going to work on Bireli specifically. If Hub and Site1 need to talk to each other this isn't going to work.

kousu commented 1 year ago

@namgo can you pull out whatever apt commands and config file edits you made and post them in here? I can probably pull it out but it'd be faster if you could remember.

Once that's here to examine, I suspect there is a way to combine it with neuropoly/computers#461! It sounds like in the end you didn't have to make too many invasive changes in the end -- you set up libvirt (i.e. neuropoly/computers#461) and dnsmasq, and those should fit into our ansible :) The bulk of the work in doing CODA is going to be inside those VMs and I think that we can probably leave out of ansible for the foreseeable future.

namgo commented 1 year ago

@kousu this one's fairly non-invasive you're right, but it's not stable by any means. I'm using dnsmasq as a non-daemon process that I have running in a tmux session to ensure that each VM has a domain name addressable by Bireli.

It's not an unprivileged VM setup as I haven't given anyone libvirt group access.

Never the less, documenting is good you're right!

dnsmasq -R --interface=docker0 --except-interface=lo0 -d -C coda-resolver --bind-interfaces with a config file coda-resolver containing:

address=/hub.coda/192.168.122.2
address=/site1.coda/192.168.122.3
interface=docker0

The two VMs were set up manually and I assigned each an address.

namgo commented 1 year ago

Just had a call with Louis-Francois, looks like we can up the memory and cpu count of the hub.coda VM and get rid of site1.

He'll be working on this tomorrow so I'll document further discussion here.

namgo commented 1 year ago

Small update on this: Louis-Francois will continue working on CODA today, and we'll probably have a call on Wednesday. I offered to set up some more of the system but he stressed the importance of learning how it works himself which I'm very happy with :)

namgo commented 1 year ago

We had a call with our contact with the CODA project, he found out that caprover might not work well with snap's docker... I've redirected DNS to the other ubuntu VM I set up on bireli for this project and re-installing docker on that from the repository.

(somewhat surprisingly? docker is weird) it works now!

namgo commented 1 year ago

Louis-Francois Bouchard is on vacation at the moment, so we'll resume when he gets back.

namgo commented 1 year ago

Louis-Francois and I have been working at this. We've run into a bit of a snag where the repo is called every time we initialize a container, and the repo has security checks that prevent us from working locally (ensuring "captain" is running with https enabled regardless of whether it is or isn't behind a firewall, is probably the first of many such checks).

Our two options appear to be:

Louis-Francois and I have opted for the first option because there's probably something we're not seeing.

namgo commented 1 year ago

I got a message from Louis-Francois (originally from a developer), whose suggestion and question made me wonder if in fact a valid ssl certificate is not a requirement, that we only need a certificate to pass the check. I may well be misunderstanding the system still, but I am going to try to force https on captain and see what happens. Gonna try that in 20 minutes or so.

louisfb01 commented 1 year ago

This is what we get when trying to enable https (as needed) on the caprover instance:

Screenshot 2023-09-06 at 1 56 41 PM

louisfb01 commented 1 year ago

tl;dr: We cannot enable HTTPS since we are deploying caprover locally, which ends up causing SSL issues (see below).

Right now, we are creating a single caprover instance in one VM and deploy both the hub and site on this instance. When deploying the hub for coda, we change the hub.coda domain to http://captain.captain.localhost/ within the caprover instance, as mentioned in the caprover documentation, but cannot enable HTTPS using localhost. This makes the CODA hub-deployer crash because of SSL not being enabled.

Possible solution: Skip the SSL check for coda when deployed locally. Maybe create a branch for local deployment alternative for QA testing at Neuropoly? Question: Can it be done, and how complicated would this be?

Screenshot 2023-09-08 at 10 39 29 AM

Screenshot 2023-09-07 at 2 33 10 PM

namgo commented 1 year ago

Excellent writeup @louisfb01! Your framing of the solution was good and got me thinking that we might be able to import the docker/caprover deployment system into this repository and comment out the checks as a temporary measure, like you're suggesting.

I feel that before we do this, we need to get confirmation that this is the right way to go.

@louisfb01 Would you be comfortable reading through the docker deployment system and taking note of any checks like the https one? I want to make sure that there's not going to be any further surprises :D just in case.

kousu commented 1 year ago

In the CODA deployment guide, they state

In order to deploy the CODA platform sandbox using CapRover, you will need a registered domain name and access to DNS settings. Throughout this guide, we will use coda-platform.com as the example base domain.

We have DNS we could use: @namgo has access to Namecheap, and could give you subdomains, say, *.coda.neuropoly.org. But since they're using letsencrypt by default to actually get the SSL working, you will run into the problem that our network admins are skeptical about opening ports (e.g. https://github.com/neuropoly/computers/issues/320, https://github.com/neuropoly/computers/issues/337) and will drag their feet on doing it for you. But if you can talk to our network admins clearly enough, you should be able to get them to open the ports for you, and then be well. Alternately, maybe there is a setting in their deployment script that would let you switch to using the DNS challenge, but I don't think there's a Namecheap API that could work with that so we'd have to delegate coda.neuropoly.org to a different DNS hoster and it'd be pretty tricky.


I'm confused. Those instructions don't seem to address deployment behind a firewall. I must be missing something, isn't the target audience researchers working at institutions? Institutional firewalls are always strict and always block letsencrypt. I also don't understand why they suggest deploying outside of the firewall

As an alternative, you can deploy a VM with CapRover pre-installed on DigitalOcean.

because DigitalOcean is not covered by any kind of NDA or data sharing agreement or PII protection plan that any institution would agree to. And isn't the point of CODA that everyone can keep and analyse their data locally without having to break their PII plans?


If you can grep in the CODA codebase for caprover serversetup (it might be written ["caprover", "serversetup"] or ["caprover"] + ["serversetup"] or {'caprover', 'serversetup'} or one of many variations, watch out!) and find where that happens, you should be able to patch it to add skipVerifyingDomains like the caprover instructions say?

louisfb01 commented 1 year ago

Update: Managed to deploy all of coda hub and site on the same caprover instance.

I had to fork most repositories to remove SSL verifications and update some npm packages. Louis also helped me with other bugs for the site-deployer with the stats-api repo, which he pushed to the main repo.

Now only need to confirm everything works as expected!

namgo commented 1 year ago

Looks like we're going to have some trouble accessing the containers over the network, so my suggestion is to script a socat redirect from the VM's (external net) ip to the VM's docker network. I gave Louis-Francois the basic ideas and would recommend he reads up on http://www.dest-unreach.org/socat/ but if he doesn't get to it, I'll be able to.

kousu commented 1 year ago

I believe this issue is done for now, since @louisfb01 has left to pursue a company, so this is on pause. Should we archive this whole repo?

I'll open a new issue in https://github.com/neuropoly/computers/ to pull bireli back under the normal ansible fleet.

jcohenadad commented 1 year ago

Should we archive this whole repo?

Yes, maybe we should do that indeed. Thank you