Closed awwong1 closed 5 years ago
$ mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 44
Server version: 5.7.24-27 Percona Server (GPL), Release '27', Revision 'bd42700'
Copyright (c) 2009-2018 Percona LLC and/or its affiliates
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> show processlist;
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
| 3 | alexander | localhost | sotorrent18_12 | Query | 20209 | executing | LOAD XML LOCAL INFILE 'Posts.xml'
INTO TABLE `Posts`
ROWS IDENTIFIED BY '<row>' | 0 | 0 |
| 44 | alexander | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
2 rows in set (0.00 sec)
I always test the SOTorrent releases on a regular Ubuntu 16.04 LTS desktop PC running MySQL Server 5.7. Importing the dataset takes about two days in total, and loading Posts.xml
alone does definitely take less than 5 hours. However, I have to say that our test system has two separate SSDs: One that runs the operating system and contains the database files and one that is only used to read the XML and CSV files from.
@sbaltes Are you willing to share your system's specs and mysql.cnf
file? My mysql data directory is on two HDDs running in RAID 1. I don't think I'm bottlenecked by hardware yet. The average write speed was ~15MB/s over the last 6 hours.
Even while mysql is loading the xml file, I can get well over that in terms of disk read/write.
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 1.8M 13G 1% /run
/dev/mapper/patience--vg-root 457G 20G 414G 5% /
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sdb1 110G 61M 104G 1% /media/taxi
/dev/sda2 721M 145M 540M 22% /boot
/dev/md0 3.6T 423G 3.0T 13% /media/freight
/dev/sda1 511M 6.1M 505M 2% /boot/efi
/dev/sde1 1.8T 77M 1.7T 1% /media/tram
tmpfs 13G 0 13G 0% /run/user/1000
$ hdparm -tT /dev/md0
/dev/md0:
Timing cached reads: 19466 MB in 2.00 seconds = 9744.07 MB/sec
Timing buffered disk reads: 196 MB in 3.06 seconds = 64.09 MB/sec
$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.24641 s, 172 MB/s
$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test2.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.469443 s, 2.3 GB/s
$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test3.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.23511 s, 869 MB/s
It's a FUJITSU Desktop ESPRIMO P957/E85+ (Intel Core i7-7700, 64GB RAM).
The only file I modified is /etc/mysql/mysql.conf.d/mysqld.cnf
:
#
# The MySQL database server configuration file.
#
# You can copy this to one of:
# - "/etc/mysql/my.cnf" to set global options,
# - "~/.my.cnf" to set user-specific options.
#
# One can use all long options that the program supports.
# Run program with --help to get a list of available options and with
# --print-defaults to see which it would actually understand and use.
#
# For explanations see
# http://dev.mysql.com/doc/mysql/en/server-system-variables.html
# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.
# Here is entries for some specific programs
# The following values assume you have at least 32M ram
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
# allow file export to that folder
secure-file-priv = /ssd/tmp/
#
# * Basic Settings
#
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /ssd/mysql
tmpdir = /ssd/tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address = 0.0.0.0
#
# * Fine Tuning
#
key_buffer_size = 10G
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover-options = BACKUP
#max_connections = 100
#table_cache = 64
#thread_concurrency = 10
#
# * Query Cache Configuration
#
query_cache_limit = 1M
query_cache_size = 16M
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file = /var/log/mysql/mysql.log
#general_log = 1
#
# Error log - should be very few entries.
#
log_error = /var/log/mysql/error.log
#
# Here you can see queries with especially long duration
#log_slow_queries = /var/log/mysql/mysql-slow.log
#long_query_time = 2
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
# other settings you may need to change.
#server-id = 1
#log_bin = /var/log/mysql/mysql-bin.log
expire_logs_days = 10
max_binlog_size = 100M
#binlog_do_db = include_database_name
#binlog_ignore_db = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
innodb_buffer_pool_size = 48G
innodb_buffer_pool_instances = 8
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# For generating SSL certificates I recommend the OpenSSL GUI "tinyca".
#
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem
Okay, this is my configuration file:
$ cat /etc/mysql/percona-server.conf.d/mysqld.cnf
[mysqld]
# GENERAL #
user = mysql
port = 3306
default-storage-engine = InnoDB
socket = /var/run/mysqld/mysqld.sock
pid-file = /var/run/mysqld/mysqld.pid
# MyISAM #
key-buffer-size = 32M
myisam-recover-options = FORCE,BACKUP
# SAFETY #
max-allowed-packet = 16M
max-connect-errors = 1000000
# DATA STORAGE #
datadir = /media/freight/mysql
# CACHES AND LIMITS #
tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 2048
# INNODB #
innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table = 1
innodb-buffer-pool-size = 100G
innodb-buffer-pool-instances = 8
# LOGGING #
log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
# MISC #
basedir = /usr
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
explicit_defaults_for_timestamp
# Recommended in standard MySQL setup
# sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_ALL_TABLES
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
I'll let this run for another day and then see what happens.
One more silly question, can you run the following query and share the results?
SELECT table_schema as `Database`, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`, table_rows as `Num Rows` FROM information_schema.TABLES WHERE table_schema = 'sotorrent18_12' ORDER BY (data_length + index_length) DESC;
Sure, here you go:
+----------------+------------------------+------------+-----------+
| Database | Table | Size in MB | Num Rows |
+----------------+------------------------+------------+-----------+
| sotorrent18_12 | PostBlockVersion | 131258.86 | 202582014 |
| sotorrent18_12 | PostHistory | 123914.00 | 111996687 |
| sotorrent18_12 | PostBlockDiff | 77879.92 | 137823683 |
| sotorrent18_12 | Posts | 64010.97 | 40938286 |
| sotorrent18_12 | Comments | 14023.00 | 67795178 |
| sotorrent18_12 | PostVersionUrl | 13157.89 | 33994431 |
| sotorrent18_12 | Votes | 10854.95 | 161224192 |
| sotorrent18_12 | PostVersion | 9296.88 | 67185215 |
| sotorrent18_12 | GHMatches | 5793.64 | 599713 |
| sotorrent18_12 | TitleVersion | 3522.86 | 19128135 |
| sotorrent18_12 | CommentUrl | 2499.63 | 7186461 |
| sotorrent18_12 | PostReferenceGH | 2056.27 | 6200977 |
| sotorrent18_12 | Badges | 1906.00 | 29041921 |
| sotorrent18_12 | Users | 1523.00 | 9292235 |
| sotorrent18_12 | PostLinks | 439.08 | 5484805 |
| sotorrent18_12 | Tags | 3.52 | 52317 |
| sotorrent18_12 | PostType | 0.02 | 9 |
| sotorrent18_12 | PostHistoryType | 0.02 | 31 |
| sotorrent18_12 | PostBlockType | 0.02 | 2 |
| sotorrent18_12 | PostBlockDiffOperation | 0.02 | 3 |
+----------------+------------------------+------------+-----------+
I gave up importing the data into mysql. It was faster for me to load everything into sqlite3 (this took around 2 full days).
I gave up importing the data into mysql.
I'm sorry to hear that.
It was faster for me to load everything into sqlite3 (this took around 2 full days).
I will update our system to Ubuntu 18.04 before importing the next SOTorrent release. Maybe I can reproduce the issues you faced. Until then, I'm happy to point others to the script you wrote. Thanks!
Hi,
I am sorry to intervene, just out of curiosity and pure interest, I have a few suggestions that may speed up the the whole loading process:
The default MySQL storage engine is called InnoDB, it is more suited for transactional workloads (e.g., lots of small concurrent reads and writes), but for analytic workloads such as running different queries to find a specific subset of the data (usually no writes and a lot of reads) other storage engines may perform better, I suggest Aria, it supports full-text indexing and search (SOTorrent dataset contains so much text) and it does not support foreign keys yet so I guess it takes less time to load the data into tables.
I am assuming the integrity of SOTorrent data is very much checked and verified during its publications, so you could try to load the data into tables that have no primary or foreign keys (you know the relations exist in the data but you don't tell the database engine to check them every time you read or write a record). Again, InnoDB is not a good option here because it needs at least one attribute as a (pre-sorted) primary key and will create it itself implicitly if you don't provide it with a primary key in you table. You can try adding the constraints after loading the data.
If you need more advanced features for searching in post texts I guess you could try Elasticsearch. Obviously, it has more advanced text processing and search capabilities than a RDBMS like MySQL and you could, for example, directly run advanced queries on posts themselves in it.
Notice: I haven't tried any of these myself, just had similar experience with loading huge tables into other databases.
Thanks for your feedback. One small comment regarding:
Removing the constraints from the table definitions
The foreign key checks are being disabled before loading the data, see, e.g.:
@sbaltes thanks for mentioning that. Actually, I was not quite sure how consequent disabling and then enabling the foreign key check for each load operation could affect the whole time needed for the process of loading all the tables, as in the helper script (6_load_sotorrent.sql
) foreign key checks are disabled right before each load operation and then enabled right after the table has been loaded. I think it might be faster to disable the foreign key checks right at the start of the script then enabling it just at its end; the inside of 6_load_sotorrent.sql
could look like something like this I guess:
SET FOREIGN_KEY_CHECKS=0;
-- CONTENTS OF CURRENT 6_load_sotorrent.sql WITHOUT TWEAKING THE FOREIGN KEY CHECKS SEPARATELY FOR EACH TABLE
SET FOREIGN_KEY_CHECKS=1;
I guess the main drawback in this scenario would be that the foreign key constraints on the tables are not verified until all the tables are loaded and this might cause late data inconsistency errors.
I will update our system to Ubuntu 18.04 before importing the next SOTorrent release. Maybe I can reproduce the issues you faced.
I'm currently uploading the new release 2019-03-17 to Zenodo. I created and tested this release on Windows 10 and Ubuntu 18.04 running MySQL Server 8.0. The import takes less than 24 hours on the above-mentioned hardware. I added some tips to the README file.
Hello everyone, maybe someone more familiar with databases can provide some insight? I'm trying to load the SOTorrent dataset into
mysql
(only thing supported by the provided scripts).I've tried using the default ubuntu 18.04 mysql package (5.7) with
innodb_pool_buffer_size
set to 100G (80% of my machine's RAM) with little success. It hung on loading posts from xml despite any configuration.I then uninstalled mysql and downloaded Percona Server 5.7 which no longer hangs on loading the data (even without changing the provided mysql.cnf), but is still taking a long time (over 5 hours to load Posts.xml).
Is this normal? If anyone else has done this, can they share how long this step took for them?
Full mysql logs can be found at this github gist.