varfish-org / varfish-server

VarFish: comprehensive DNA variant analysis for diagnostics and research
MIT License
43 stars 11 forks source link

Improve documentation of docker compose deployment #79

Closed holtgrewe closed 3 years ago

holtgrewe commented 3 years ago

Is your feature request related to a problem? Please describe. There's a lot of documentation already. Some docker compose specific one is missing.

Describe the solution you'd like Add documentation (collecting content of some emails that I sent in the comments).

Describe alternatives you've considered N/A

Additional context N/A

holtgrewe commented 3 years ago

it's good to hear that you were successful in using our docker compose file to install Varfish. Could you provide estimates on how long it took for you to get everything setup (besides the probably relatively long download)?

The answer below is pretty verbose. I hope that it is helpful and answers your questions.

#1

Docker (compose) creates a private network on the internal host. This is identified as "network varfish" in the compose file and shows as varfish-docker-compose_varfish in "docker network ls".

# docker network ls
NETWORK ID     NAME                             DRIVER    SCOPE
039f7e3430df   bridge                           bridge    local
a2ae838af4a0   host                             host      local
7b4c1ce4b8d8   none                             null      local
3c65d44f3ee3   varfish-docker-compose_varfish   bridge    local

For example, you will find that by default only ports 80 and 443 are open on the host machine.

# lsof -i -sTCP:LISTEN
COMMAND      PID USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
sshd         275 root    3u  IPv4 822142035      0t0  TCP *:ssh (LISTEN)
sshd         275 root    4u  IPv6 822142037      0t0  TCP *:ssh (LISTEN)
docker-pr 205513 root    4u  IPv4 854059572      0t0  TCP *:https (LISTEN)
docker-pr 205527 root    4u  IPv4 854049119      0t0  TCP *:http (LISTEN)

What exactly are you trying to achieve here?

Changing ports does not make things any more secure IMO but is simple enough. Change the lines here: https://github.com/bihealth/varfish-docker-compose/blob/main/docker-compose.yml#L7. The syntax is ":", so this should do what you want:

    ports:
      - "8080:80"
      - "8443:443"

If you only want to allow access from certain IP ranges or sub networks then you can do this either by configuring your server firewall appropriately (firewalld/ufw on CentOS/Debian&Ubuntu are both not that complicated after investing some time in learning about them). You can also achieve something like this by configuring the "traefik..." labels on the varfish-web contanier appropriately.

https://github.com/bihealth/varfish-docker-compose/blob/main/docker-compose.yml#L40

I've never used that traefik feature, but here it is:

https://doc.traefik.io/traefik/middlewares/ipwhitelist/

You should also be able to add HTTP Auth Basic fairly easily as a "first line of defense":

https://doc.traefik.io/traefik/middlewares/basicauth/

That being said, traefik configuration is a bit frustrating in my experience but the traefik community is large (and documentation is ... complete - there are many features) so you should be able to find your way there as well.

#2

What do you mean with project data? ;) The VarFish docker compose has two containers that store relevant amounts of data.

First of all, you can bind-mount any directory to the correct place by changing the "volumes/.../source" path, e.g., the following would mount /wher/ever on the host to /data in the minio container:

    volumes:
      - type: bind
        source: /wher/ever
        target: /data

You have to restart the docker compose site to update to such changes.

However, the single VarFish postgres database will host the majority of the data. The background database tables are only written on the initial import so you might get away with placing them on data accessible via the network. The performance for insertion depends on the disk throughput and latency of "fsync()" so inserting data will be slower over the network than locally.

Now, is it possible to separate storage locations of different tables in one database? In principle: yes. You can define something called "table spaces" in postgres and then explicitely specify where data for a table is stored. Also, you can move tables into another tablespace after creation (I just learned this and was positively suprised about postgres -- once again!):

https://pgdash.io/blog/tablespaces-postgres.html

So, how would you go about doing this? Look at the following lines of the postgres container:

      - type: bind
        source: ./volumes/postgres/data
        target: /var/lib/postgresql/data

This is the main $PGDATA and we have to keep this. However, we can simply add a second bind mount

      - type: bind
        source: /location/of/fcoe-array
        target: /data/fcoe-array

After resttarting the postgres container you now have /data/fcoe-array and it will access the given location. You can look around container by attaching to them just like so:

host # docker exec -it varfish-docker-compose_postgres_1 bash -i root@#

If you want a "psql" shell to your database I would recommend to open this through the varfish-web container as the database connection parameters are already setup and the varfish user is configured as a superuser so you are able to conect to the postgres database container easily.

host # docker exec -it varfish-docker-compose_varfish-web_1 bash -i
root@7a3fdb337fae # cd /usr/src/app
root@7a3fdb337fae # python manage.py dbshell
psql (11.10 (Debian 11.10-0+deb10u1), server 12.5 (Debian 12.5-1.pgdg100+1))
WARNING: psql major version 11, server major version 12.
         Some psql features might not work.
Type "help" for help.

varfish=# \db+
                                 List of tablespaces
    Name    |  Owner  | Location | Access privileges | Options |  Size  | Description
------------+---------+----------+-------------------+---------+--------+-------------
 pg_default | varfish |          |                   |         | 397 GB |
 pg_global  | varfish |          |                   |         | 623 kB |
(2 rows)

On the psql shell you can now issue the commands to create the tablespace etc. Note that psql will connect you to the postgres server running in the postgres container so the bind mount has to be in the postgres container and not the varfish-web container. The varfish-web container will not be aware of the bind mount!

You can use "\dt" on the psql shell to list all databases. You will see that the variant tables are split into partitions of 1024 entries each to improve query size. You probably want to keep these on the NVME. The bulk of the data should be in the frequencies_* tables. Judging from the raw import, the following datasets are largest.

890684    GRCh37/ExAC
1064437    GRCh37/knowngeneaa
2452525    GRCh37/thousand_genomes
2699901    GRCh37/gnomAD_exomes
14901824    GRCh37/dbSNP
15258705    GRCh37/extra-annos
35719247    GRCh37/gnomAD_genomes

You probably want to move the following worst offenders to the FCOE array following the instructions from above.

conservation_knowngeneaa
dbsnp_dbsnp
frequencies_*
extra_annos_*

This requires copying of a lot of data. I'd be interested to learn how long it took in your environment.

Please note that there is no 1:1 correspondence between tables and files in postgres (you might be used to this from MySQL/MariaDB). I would expect that data that is moved from the default tablespace to your FCOE one will not be freed immediately on the NVME one. See the postgres documentation on VACUUM for more information.

https://www.postgresql.org/docs/12/sql-vacuum.html

I'm pretty certain that you want to do "VACUUM FULL".