Closed holtgrewe closed 3 years ago
it's good to hear that you were successful in using our docker compose file to install Varfish. Could you provide estimates on how long it took for you to get everything setup (besides the probably relatively long download)?
The answer below is pretty verbose. I hope that it is helpful and answers your questions.
#1
Docker (compose) creates a private network on the internal host. This is identified as "network varfish" in the compose file and shows as varfish-docker-compose_varfish in "docker network ls".
# docker network ls
NETWORK ID NAME DRIVER SCOPE
039f7e3430df bridge bridge local
a2ae838af4a0 host host local
7b4c1ce4b8d8 none null local
3c65d44f3ee3 varfish-docker-compose_varfish bridge local
For example, you will find that by default only ports 80 and 443 are open on the host machine.
# lsof -i -sTCP:LISTEN
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sshd 275 root 3u IPv4 822142035 0t0 TCP *:ssh (LISTEN)
sshd 275 root 4u IPv6 822142037 0t0 TCP *:ssh (LISTEN)
docker-pr 205513 root 4u IPv4 854059572 0t0 TCP *:https (LISTEN)
docker-pr 205527 root 4u IPv4 854049119 0t0 TCP *:http (LISTEN)
What exactly are you trying to achieve here?
Changing ports does not make things any more secure IMO but is simple enough. Change the lines here: https://github.com/bihealth/varfish-docker-compose/blob/main/docker-compose.yml#L7. The syntax is "
ports:
- "8080:80"
- "8443:443"
If you only want to allow access from certain IP ranges or sub networks then you can do this either by configuring your server firewall appropriately (firewalld/ufw on CentOS/Debian&Ubuntu are both not that complicated after investing some time in learning about them). You can also achieve something like this by configuring the "traefik..." labels on the varfish-web contanier appropriately.
https://github.com/bihealth/varfish-docker-compose/blob/main/docker-compose.yml#L40
I've never used that traefik feature, but here it is:
https://doc.traefik.io/traefik/middlewares/ipwhitelist/
You should also be able to add HTTP Auth Basic fairly easily as a "first line of defense":
https://doc.traefik.io/traefik/middlewares/basicauth/
That being said, traefik configuration is a bit frustrating in my experience but the traefik community is large (and documentation is ... complete - there are many features) so you should be able to find your way there as well.
#2
What do you mean with project data? ;) The VarFish docker compose has two containers that store relevant amounts of data.
First of all, you can bind-mount any directory to the correct place by changing the "volumes/.../source" path, e.g., the following would mount /wher/ever on the host to /data in the minio container:
volumes:
- type: bind
source: /wher/ever
target: /data
You have to restart the docker compose site to update to such changes.
However, the single VarFish postgres database will host the majority of the data. The background database tables are only written on the initial import so you might get away with placing them on data accessible via the network. The performance for insertion depends on the disk throughput and latency of "fsync()" so inserting data will be slower over the network than locally.
Now, is it possible to separate storage locations of different tables in one database? In principle: yes. You can define something called "table spaces" in postgres and then explicitely specify where data for a table is stored. Also, you can move tables into another tablespace after creation (I just learned this and was positively suprised about postgres -- once again!):
https://pgdash.io/blog/tablespaces-postgres.html
So, how would you go about doing this? Look at the following lines of the postgres container:
- type: bind
source: ./volumes/postgres/data
target: /var/lib/postgresql/data
This is the main $PGDATA and we have to keep this. However, we can simply add a second bind mount
- type: bind
source: /location/of/fcoe-array
target: /data/fcoe-array
After resttarting the postgres container you now have /data/fcoe-array and it will access the given location. You can look around container by attaching to them just like so:
host # docker exec -it varfish-docker-compose_postgres_1 bash -i
root@
If you want a "psql" shell to your database I would recommend to open this through the varfish-web container as the database connection parameters are already setup and the varfish user is configured as a superuser so you are able to conect to the postgres database container easily.
host # docker exec -it varfish-docker-compose_varfish-web_1 bash -i
root@7a3fdb337fae # cd /usr/src/app
root@7a3fdb337fae # python manage.py dbshell
psql (11.10 (Debian 11.10-0+deb10u1), server 12.5 (Debian 12.5-1.pgdg100+1))
WARNING: psql major version 11, server major version 12.
Some psql features might not work.
Type "help" for help.
varfish=# \db+
List of tablespaces
Name | Owner | Location | Access privileges | Options | Size | Description
------------+---------+----------+-------------------+---------+--------+-------------
pg_default | varfish | | | | 397 GB |
pg_global | varfish | | | | 623 kB |
(2 rows)
On the psql shell you can now issue the commands to create the tablespace etc. Note that psql will connect you to the postgres server running in the postgres container so the bind mount has to be in the postgres container and not the varfish-web container. The varfish-web container will not be aware of the bind mount!
You can use "\dt" on the psql shell to list all databases. You will see that the variant tables are split into partitions of 1024 entries each to improve query size. You probably want to keep these on the NVME. The bulk of the data should be in the frequencies_*
tables. Judging from the raw import, the following datasets are largest.
890684 GRCh37/ExAC
1064437 GRCh37/knowngeneaa
2452525 GRCh37/thousand_genomes
2699901 GRCh37/gnomAD_exomes
14901824 GRCh37/dbSNP
15258705 GRCh37/extra-annos
35719247 GRCh37/gnomAD_genomes
You probably want to move the following worst offenders to the FCOE array following the instructions from above.
conservation_knowngeneaa
dbsnp_dbsnp
frequencies_*
extra_annos_*
This requires copying of a lot of data. I'd be interested to learn how long it took in your environment.
Please note that there is no 1:1 correspondence between tables and files in postgres (you might be used to this from MySQL/MariaDB). I would expect that data that is moved from the default tablespace to your FCOE one will not be freed immediately on the NVME one. See the postgres documentation on VACUUM for more information.
https://www.postgresql.org/docs/12/sql-vacuum.html
I'm pretty certain that you want to do "VACUUM FULL".
Is your feature request related to a problem? Please describe. There's a lot of documentation already. Some docker compose specific one is missing.
Describe the solution you'd like Add documentation (collecting content of some emails that I sent in the comments).
Describe alternatives you've considered N/A
Additional context N/A