Closed forus closed 2 years ago
I have started with fixing some of the issues in this PR https://github.com/opentargets/genetics-backend/pull/10
@mkarmona Hi Miguel! Do you maybe have internal knowledge about this topic that you can share?
@forus , yes I do have. They changed some configuration that creates a constraint in the way the number of partitions for a table can grow up. In order to fix it you might want to change these lines:
If you look at it carefully you will see that the partition field set has been reduced to just the chromosome. Having the source_id
column as partition by was not one of my great ideas. I hope this helps @forus
@mkarmona Thanks! ah, I see. It would solve Too many partitions for single INSERT block (more than 100).
issue. What about the rest? Have you experienced the memory limit for the query? How much RAM do you allocate for an instance to load data?
@forus I use an internal script to spin a new GCloud VM so what you are including there is not really necessary for us. I normally use highmem
16 cores which is > 100GB mem for the image building and baking but I would use 8 cores highmem
on GCloud machine type.
here my script template for it (this is for the next release we are building
#!/bin/bash
OT_RCFILE=/etc/opentargetsrc
if [ -f "$OT_RCFILE" ]; then
echo "$OT_RCFILE exist so machine is already configured"
exit 0
fi
apt-get update && DEBIAN_FRONTEND=noninteractive \
apt-get \
-o Dpkg::Options::="--force-confnew" \
--force-yes \
-fuy \
dist-upgrade && \
DEBIAN_FRONTEND=noninteractive \
apt-get \
-o Dpkg::Options::="--force-confnew" \
--force-yes \
-fuy \
-t stretch-backports install openjdk-8-jdk-headless net-tools wget uuid-runtime python-pip python-dev libyaml-dev httpie jq gawk zip unzip tmux git build-essential less silversearcher-ag dirmngr psmisc
# pip install -U pip
# apt-get remove --force-yes -fuy python-pip
pip install elasticsearch-loader
elastic_version=5.6.15
elastic_deb=https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${elastic_version}.deb
kibana_deb=https://artifacts.elastic.co/downloads/kibana/kibana-${elastic_version}-amd64.deb
# instance_name=$(http --ignore-stdin --check-status 'http://metadata.google.internal/computeMetadata/v1/instance/name' "Metadata-Flavor:Google" -p b --pretty none)
cluster_id=$(uuidgen -r)
# cluster_id=$instance_name
(cd /tmp; \
wget --no-check-certificate $elastic_deb; \
dpkg -i elasticsearch-${elastic_version}.deb; \
rm -f elasticsearch-${elastic_version}.deb)
cat <<EOF > /etc/elasticsearch/elasticsearch.yml
cluster.name: ${cluster_id}
network.host: 0.0.0.0
http.port: 9200
bootstrap.memory_lock: true
repositories.url.allowed_urls: ["https://s3-eu-west-1.amazonaws.com/*","https://storage.googleapis.com/*"]
reindex.remote.whitelist: 10.*.*.*:*, _local_:*
EOF
cat <<EOF > /etc/elasticsearch/jvm.options
-Xms16g
-Xmx16g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
EOF
cat <<EOF > /etc/security/limits.conf
* soft nofile 65536
* hard nofile 65536
* soft memlock unlimited
* hard memlock unlimited
EOF
cat <<EOF > /etc/sysctl.conf
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.ip_forward = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
kernel.randomize_va_space = 1
fs.file-max = 65535
kernel.pid_max = 65536
net.ipv4.ip_local_port_range = 2000 65000
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_max_syn_backlog = 3240000
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_tw_buckets = 1440000
net.core.rmem_default = 8388608
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_congestion_control = cubic
vm.swappiness = 1
net.ipv4.tcp_tw_reuse = 1
EOF
# set all sysctl configurations
sysctl -p
echo disable swap noop scheduler
swapoff -a
echo 'never' | tee /sys/kernel/mm/transparent_hugepage/enabled
echo 'noop' | tee /sys/block/sda/queue/scheduler
echo "block/sda/queue/scheduler = noop" >> /etc/sysfs.conf
echo "kernel/mm/transparent_hugepage/enabled = never" >> /etc/sysfs.conf
sed -i 's/\#LimitMEMLOCK=infinity/LimitMEMLOCK=infinity/g' /usr/lib/systemd/system/elasticsearch.service
sed -i '46iLimitMEMLOCK=infinity' /usr/lib/systemd/system/elasticsearch.service
systemctl daemon-reload
echo install slasticsearch plugins gcs and gce
(cd /usr/share/elasticsearch/bin; \
./elasticsearch-plugin install repository-gcs; \
./elasticsearch-plugin install analysis-icu; \
./elasticsearch-plugin install discovery-gce)
echo start elasticsearch
systemctl enable elasticsearch
# /etc/init.d/elasticsearch start
systemctl start elasticsearch
echo install kibana
(cd /tmp; \
wget --no-check-certificate $kibana_deb; \
dpkg -i kibana-${elastic_version}-amd64.deb; \
rm -f kibana-${elastic_version}-amd64.deb)
cat <<EOF > /etc/kibana/kibana.yml
server.port: 5601
server.host: "localhost"
elasticsearch.url: "http://127.0.0.1:9200"
EOF
echo start kibana
systemctl enable kibana
sleep 3 && systemctl start kibana
echo install clickhouse
echo "deb http://repo.yandex.ru/clickhouse/deb/stable/ main/" > /etc/apt/sources.list.d/clickhouse.list
apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4
apt-get update && DEBIAN_FRONTEND=noninteractive \
apt-get \
-o Dpkg::Options::="--force-confnew" \
--force-yes \
-fuy \
dist-upgrade && \
DEBIAN_FRONTEND=noninteractive \
apt-get \
-o Dpkg::Options::="--force-confnew" \
--force-yes \
-fuy \
install clickhouse-client clickhouse-server
service clickhouse-server stop
/etc/init.d/clickhouse-server stop
killall clickhouse-server && sleep 5
cat <<EOF > /etc/clickhouse-server/config.xml
<?xml version="1.0"?>
<yandex>
<logger>
<level>warning</level>
<log>/var/log/clickhouse-server/clickhouse-server.log</log>
<errorlog>/var/log/clickhouse-server/clickhouse-server.err.log</errorlog>
<size>1000M</size>
<count>7</count>
</logger>
<display_name>ot-genetics</display_name>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<interserver_http_port>9009</interserver_http_port>
<listen_host>::</listen_host>
<listen_host>0.0.0.0</listen_host>
<listen_try>0</listen_try>
<listen_reuse_port>1</listen_reuse_port>
<listen_backlog>256</listen_backlog>
<max_connections>2048</max_connections>
<keep_alive_timeout>5</keep_alive_timeout>
<max_concurrent_queries>256</max_concurrent_queries>
<max_open_files>262144</max_open_files>
<uncompressed_cache_size>17179869184</uncompressed_cache_size>
<mark_cache_size>17179869184</mark_cache_size>
<!-- Path to data directory, with trailing slash. -->
<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
<user_files_path>/var/lib/clickhouse/user_files/</user_files_path>
<users_config>users.xml</users_config>
<default_profile>default</default_profile>
<!-- <system_profile>default</system_profile> -->
<default_database>default</default_database>
<umask>022</umask>
<zookeeper incl="zookeeper-servers" optional="true" />
<macros incl="macros" optional="true" />
<dictionaries_config>*_dictionary.xml</dictionaries_config>
<builtin_dictionaries_reload_interval>3600</builtin_dictionaries_reload_interval>
<max_session_timeout>3600</max_session_timeout>
<default_session_timeout>60</default_session_timeout>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
</distributed_ddl>
<format_schema_path>/var/lib/clickhouse/format_schemas/</format_schema_path>
</yandex>
EOF
cat <<EOF > /etc/clickhouse-server/users.xml
<?xml version="1.0"?>
<yandex>
<profiles>
<default>
<max_memory_usage>100000000000</max_memory_usage>
<max_bytes_before_external_sort>80000000000</max_bytes_before_external_sort>
<max_bytes_before_external_group_by>80000000000</max_bytes_before_external_group_by>
<use_uncompressed_cache>1</use_uncompressed_cache>
<load_balancing>random</load_balancing>
</default>
<readonly>
<readonly>128</readonly>
</readonly>
</profiles>
<users>
<default>
<password></password>
<networks incl="networks" replace="replace">
<ip>::/0</ip>
</networks>
<profile>default</profile>
<quota>default</quota>
</default>
<readonly>
<password></password>
<networks incl="networks" replace="replace">
<ip>::1</ip>
<ip>127.0.0.1</ip>
</networks>
<profile>readonly</profile>
<quota>default</quota>
</readonly>
</users>
<quotas>
<default>
<interval>
<duration>3600</duration>
<queries>0</queries>
<errors>0</errors>
<result_rows>0</result_rows>
<read_rows>0</read_rows>
<execution_time>0</execution_time>
</interval>
</default>
</quotas>
</yandex>
EOF
cat <<EOF > /etc/clickhouse-server/v2g_weights_dictionary.xml
<dictionaries>
<dictionary>
<name>v2gw</name>
<source>
<file>
<path>/etc/clickhouse-server/dictionaries/v2g_scoring_source_weights.json</path>
<format>JSONEachRow</format>
</file>
</source>
<layout>
<complex_key_hashed />
</layout>
<lifetime>600</lifetime>
<structure>
<key>
<attribute>
<name>source_id</name>
<type>String</type>
</attribute>
</key>
<attribute>
<name>weight</name>
<type>Float64</type>
<null_value>0.0</null_value>
</attribute>
</structure>
</dictionary>
</dictionaries>
EOF
mkdir /etc/clickhouse-server/dictionaries
echo copying dictionaries from gcloud to clickhouse dictionaries folder
gsutil cp gs://genetics-portal-data/lut/v2g_scoring_source_weights.190521.json \
/etc/clickhouse-server/dictionaries/v2g_scoring_source_weights.json
chown -R clickhouse:clickhouse dictionaries/
systemctl enable clickhouse-server
systemctl start clickhouse-server
echo "starting clickhouse... done."
echo install stackdriver
(cd /root && \
curl -O https://repo.stackdriver.com/stack-install.sh && \
bash stack-install.sh --write-gcm)
(cd /root && \
curl -sSO https://dl.google.com/cloudagents/install-logging-agent.sh && \
sudo bash install-logging-agent.sh)
service stackdriver-agent restart
echo stackdriver logging for elasticsearch configuration
cat <<EOF > /etc/google-fluentd/config.d/elasticsearch.conf
<source>
@type tail
format none
path /var/log/elasticsearch/es5*.log
pos_file /var/lib/google-fluentd/pos/elasticsearch.pos
read_from_head true
tag elasticsearch
</source>
EOF
cat <<EOF > /etc/google-fluentd/config.d/clickhouse.conf
<source>
@type tail
format none
path /var/log/clickhouse-server/clickhouse-server.*.log
pos_file /var/lib/google-fluentd/pos/elasticsearch.pos
read_from_head true
tag elasticsearch
</source>
EOF
cat <<EOF > /etc/tmux.conf
set -g status "on"
unbind C-b
set-option -g prefix M-x
bind-key M-x send-prefix
bind | split-window -h
bind - split-window -v
bind x killp
unbind '"'
unbind %
bind r source-file ~/.tmux.conf
bind -n M-S-Left select-window -p
bind -n M-S-Right select-window -n
bind -n M-Left select-pane -L
bind -n M-Right select-pane -R
bind -n M-Up select-pane -U
bind -n M-Down select-pane -D
set-option -g allow-rename off
set -g mouse on
set -g pane-border-fg black
set -g pane-active-border-fg brightred
set -g status-justify left
set -g status-interval 2
setw -g window-status-format " #F#I:#W#F "
setw -g window-status-current-format " #F#I:#W#F "
# setw -g window-status-format "#[fg=magenta]#[bg=black] #I #[bg=cyan]#[fg=colour8] #W "
# setw -g window-status-current-format "#[bg=brightmagenta]#[fg=colour8] #I #[fg=colour8]#[bg=colour14] #W "
setw -g window-status-current-attr dim
setw -g window-status-attr reverse
set -g status-left ''
set-option -g visual-activity off
set-option -g visual-bell off
set-option -g visual-silence off
set-window-option -g monitor-activity off
set-option -g bell-action none
setw -g mode-attr bold
set -g status-position bottom
set -g status-attr dim
set -g status-left ''
# set -g status-right '#[fg=colour233,bg=colour245,bold] %d/%m #[fg=colour233,bg=colour245,bold] %H:%M:%S '
set -g status-right-length 50
set -g status-left-length 20
setw -g window-status-current-attr bold
# setw -g window-status-current-format ' #I#[fg=colour250]:#[fg=colour255]#W#[fg=colour50]#F '
setw -g window-status-attr none
# setw -g window-status-format ' #I#[fg=colour237]:#[fg=colour250]#W#[fg=colour244]#F '
setw -g window-status-bell-attr bold
setw -g window-status-bell-fg colour255
setw -g window-status-bell-bg colour1
set -g message-attr bold
set -g default-terminal "screen-256color" # Setting the correct term
set -g status-bg default # transparent
set -g status-fg magenta
set -g status-attr default
setw -g window-status-fg blue
setw -g window-status-bg default
setw -g window-status-attr dim
setw -g window-status-current-fg brightred
setw -g window-status-current-bg default
setw -g window-status-current-attr bright
setw -g window-status-bell-bg red
setw -g window-status-bell-fg white
setw -g window-status-bell-attr bright
setw -g window-status-activity-bg blue
setw -g window-status-activity-fg white
setw -g window-status-activity-attr bright
set -g pane-border-fg white
set -g pane-border-bg default
set -g pane-active-border-fg brightblack
set -g pane-active-border-bg default
set -g message-fg default
set -g message-bg default
set -g message-attr bright
EOF
service google-fluentd restart
echo "do not forget to create the tables and load the data in"
backend_version=19.05.05
wget "https://github.com/opentargets/genetics-backend/archive/${backend_version}.zip"
unzip "${backend_version}.zip"
cd "genetics-backend-${backend_version}/loaders/clickhouse"
bash create_and_load_everything_from_scratch.sh
echo touching $OT_RCFILE
echo "backend_version=${backend_version}" > $OT_RCFILE
date >> $OT_RCFILE
We're revamping the loading of genetics and this issue appears to have been resolved.
I get all kind of errors while loading data with create_and_load_everything_from_scratch.sh script to
ot
database when I use the default settings of clickhouse (v19.5.3.8).Here discovered issues so far:
Too many partitions for single INSERT block (more than 100).
Reported in this PRMemory limit (for query) exceeded: would use 9.31 GiB
It would help a great deal to document CH requirements (e.g. how much RAM on the server is required) and custom settings that have to be changed above these to be able to load the data.