sotorrent / db-scripts

SQL and Bash scripts to import the offical Stack Overflow data dump and the SOTorrent data set, to retrieve Stack Overflow references from the BigQuery GitHub data set, and to retrieve data from the SOTorrent dataset for analysis.
Apache License 2.0
14 stars 7 forks source link

Loading Data into MySQL Performance? #12

Closed awwong1 closed 5 years ago

awwong1 commented 5 years ago

Hello everyone, maybe someone more familiar with databases can provide some insight? I'm trying to load the SOTorrent dataset into mysql (only thing supported by the provided scripts).

I've tried using the default ubuntu 18.04 mysql package (5.7) with innodb_pool_buffer_size set to 100G (80% of my machine's RAM) with little success. It hung on loading posts from xml despite any configuration.

I then uninstalled mysql and downloaded Percona Server 5.7 which no longer hangs on loading the data (even without changing the provided mysql.cnf), but is still taking a long time (over 5 hours to load Posts.xml).

Is this normal? If anyone else has done this, can they share how long this step took for them?

=====================================
2019-01-10 07:36:11 0x7fd42f0c6700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 20 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 9545 srv_active, 0 srv_shutdown, 661 srv_idle
srv_master_thread log flush and writes: 10206
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 417814
OS WAIT ARRAY INFO: signal count 626586
RW-shared spins 0, rounds 418854, OS waits 133090
RW-excl spins 0, rounds 1026711, OS waits 32607
RW-sx spins 169087, rounds 1143868, OS waits 2139
Spin rounds per wait: 418854.00 RW-shared, 1026711.00 RW-excl, 6.76 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 3469
Purge done for trx's n:o < 3469 undo n:o < 0 state: running but idle
History list length 55
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 3205, ACTIVE 18729 sec inserting
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 17241741
MySQL thread id 3, OS thread handle 140549651064576, query id 91 localhost alexander executing
LOAD XML LOCAL INFILE 'Posts.xml'
INTO TABLE `Posts`
ROWS IDENTIFIED BY '<row>'
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 2
657868 OS file reads, 3017833 OS file writes, 128523 OS fsyncs
21.90 reads/s, 16384 avg bytes/read, 33.80 writes/s, 1.80 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 141, free list len 2952, seg size 3094, 548832 merges
merged operations:
 insert 12061737, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 0 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
Hash table size 34673, node heap has 1 buffer(s)
434.03 hash searches/s, 680.27 non-hash searches/s
---
LOG
---
Log sequence number 31020000156
Log flushed up to   31018341389
Pages flushed up to 30944996237
Last checkpoint at  30943362254
Max checkpoint age    80826164
Checkpoint age target 78300347
Modified age          75003919
Checkpoint age        76637902
0 pending log flushes, 0 pending chkp writes
45849 log i/o's done, 0.65 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 139722752
Dictionary memory allocated 275767
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 2367424     (2219072 + 148352)
    Page hash           139112 (buffer pool 0 only)
    Dictionary cache    830535  (554768 + 275767)
    File system         837616  (812272 + 25344)
    Lock system         334008  (332872 + 1136)
    Recovery system     0   (0 + 0)
Buffer pool size   8191
Buffer pool size, bytes 134201344
Free buffers       1024
Database pages     7160
Old database pages 2623
Modified db pages  5975
Pending reads      0
Pending writes: LRU 0, flush list 2, single page 0
Pages made young 4054069, not young 7260092
1.05 youngs/s, 0.20 non-youngs/s
Pages read 657197, created 1610035, written 2911836
21.90 reads/s, 22.30 creates/s, 32.15 writes/s
Buffer pool hit rate 995 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 7160, unzip_LRU len: 0
I/O sum[4678]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
1 RW transactions active inside InnoDB
Process ID=40131, Main thread ID=140549265565440, state: sleeping
Number of rows inserted 56286918, updated 0, deleted 0, read 62
246.64 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

Full mysql logs can be found at this github gist.

$ uname -a
Linux patience 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ mysql --version
mysql  Ver 14.14 Distrib 5.7.24-27, for debian-linux-gnu (x86_64) using  7.0

$ free
              total        used        free      shared  buff/cache   available
Mem:      131945828     4268888    63092116        1768    64584824   126421664
Swap:       1003516           0     1003516

$ cat /proc/cpuinfo | grep model\ name | head -1
model name  : AMD Ryzen Threadripper 2920X 12-Core Processor
awwong1 commented 5 years ago
$ mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 44
Server version: 5.7.24-27 Percona Server (GPL), Release '27', Revision 'bd42700'

Copyright (c) 2009-2018 Percona LLC and/or its affiliates
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show processlist;
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
| Id | User      | Host      | db             | Command | Time  | State     | Info                                                                            | Rows_sent | Rows_examined |
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
|  3 | alexander | localhost | sotorrent18_12 | Query   | 20209 | executing | LOAD XML LOCAL INFILE 'Posts.xml'
INTO TABLE `Posts`
ROWS IDENTIFIED BY '<row>' |         0 |             0 |
| 44 | alexander | localhost | NULL           | Query   |     0 | starting  | show processlist                                                                |         0 |             0 |
+----+-----------+-----------+----------------+---------+-------+-----------+---------------------------------------------------------------------------------+-----------+---------------+
2 rows in set (0.00 sec)
sbaltes commented 5 years ago

I always test the SOTorrent releases on a regular Ubuntu 16.04 LTS desktop PC running MySQL Server 5.7. Importing the dataset takes about two days in total, and loading Posts.xml alone does definitely take less than 5 hours. However, I have to say that our test system has two separate SSDs: One that runs the operating system and contains the database files and one that is only used to read the XML and CSV files from.

awwong1 commented 5 years ago

@sbaltes Are you willing to share your system's specs and mysql.cnf file? My mysql data directory is on two HDDs running in RAID 1. I don't think I'm bottlenecked by hardware yet. The average write speed was ~15MB/s over the last 6 hours.

Even while mysql is loading the xml file, I can get well over that in terms of disk read/write.

$ df -h
Filesystem                     Size  Used Avail Use% Mounted on
udev                            63G     0   63G   0% /dev
tmpfs                           13G  1.8M   13G   1% /run
/dev/mapper/patience--vg-root  457G   20G  414G   5% /
tmpfs                           63G     0   63G   0% /dev/shm
tmpfs                          5.0M     0  5.0M   0% /run/lock
tmpfs                           63G     0   63G   0% /sys/fs/cgroup
/dev/sdb1                      110G   61M  104G   1% /media/taxi
/dev/sda2                      721M  145M  540M  22% /boot
/dev/md0                       3.6T  423G  3.0T  13% /media/freight
/dev/sda1                      511M  6.1M  505M   2% /boot/efi
/dev/sde1                      1.8T   77M  1.7T   1% /media/tram
tmpfs                           13G     0   13G   0% /run/user/1000

$ hdparm -tT /dev/md0

/dev/md0:
 Timing cached reads:   19466 MB in  2.00 seconds = 9744.07 MB/sec
 Timing buffered disk reads: 196 MB in  3.06 seconds =  64.09 MB/sec

$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.24641 s, 172 MB/s

$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test2.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.469443 s, 2.3 GB/s

$ dd count=1k bs=1M if=/dev/zero of=/media/freight/test3.img
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.23511 s, 869 MB/s
sbaltes commented 5 years ago

It's a FUJITSU Desktop ESPRIMO P957/E85+ (Intel Core i7-7700, 64GB RAM).

The only file I modified is /etc/mysql/mysql.conf.d/mysqld.cnf:

#
# The MySQL database server configuration file.
#
# You can copy this to one of:
# - "/etc/mysql/my.cnf" to set global options,
# - "~/.my.cnf" to set user-specific options.
# 
# One can use all long options that the program supports.
# Run program with --help to get a list of available options and with
# --print-defaults to see which it would actually understand and use.
#
# For explanations see
# http://dev.mysql.com/doc/mysql/en/server-system-variables.html

# This will be passed to all mysql clients
# It has been reported that passwords should be enclosed with ticks/quotes
# escpecially if they contain "#" chars...
# Remember to edit /etc/mysql/debian.cnf when changing the socket location.

# Here is entries for some specific programs
# The following values assume you have at least 32M ram

[mysqld_safe]
socket      = /var/run/mysqld/mysqld.sock
nice        = 0

[mysqld]
# allow file export to that folder
secure-file-priv = /ssd/tmp/
#
# * Basic Settings
#
user        = mysql
pid-file    = /var/run/mysqld/mysqld.pid
socket      = /var/run/mysqld/mysqld.sock
port        = 3306
basedir     = /usr
datadir     = /ssd/mysql
tmpdir      = /ssd/tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
#
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
bind-address        = 0.0.0.0
#
# * Fine Tuning
#
key_buffer_size     = 10G
max_allowed_packet  = 16M
thread_stack        = 192K
thread_cache_size       = 8
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam-recover-options  = BACKUP
#max_connections        = 100
#table_cache            = 64
#thread_concurrency     = 10
#
# * Query Cache Configuration
#
query_cache_limit   = 1M
query_cache_size        = 16M
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file        = /var/log/mysql/mysql.log
#general_log             = 1
#
# Error log - should be very few entries.
#
log_error = /var/log/mysql/error.log
#
# Here you can see queries with especially long duration
#log_slow_queries   = /var/log/mysql/mysql-slow.log
#long_query_time = 2
#log-queries-not-using-indexes
#
# The following can be used as easy to replay backup logs or for replication.
# note: if you are setting up a replication slave, see README.Debian about
#       other settings you may need to change.
#server-id      = 1
#log_bin            = /var/log/mysql/mysql-bin.log
expire_logs_days    = 10
max_binlog_size   = 100M
#binlog_do_db       = include_database_name
#binlog_ignore_db   = include_database_name
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
innodb_buffer_pool_size = 48G  
innodb_buffer_pool_instances = 8
#
# * Security Features
#
# Read the manual, too, if you want chroot!
# chroot = /var/lib/mysql/
#
# For generating SSL certificates I recommend the OpenSSL GUI "tinyca".
#
# ssl-ca=/etc/mysql/cacert.pem
# ssl-cert=/etc/mysql/server-cert.pem
# ssl-key=/etc/mysql/server-key.pem
awwong1 commented 5 years ago

Okay, this is my configuration file:

$ cat /etc/mysql/percona-server.conf.d/mysqld.cnf
[mysqld]

# GENERAL #
user    = mysql
port    = 3306
default-storage-engine  = InnoDB
socket  = /var/run/mysqld/mysqld.sock
pid-file    = /var/run/mysqld/mysqld.pid

# MyISAM #
key-buffer-size = 32M
myisam-recover-options  = FORCE,BACKUP

# SAFETY #
max-allowed-packet  = 16M
max-connect-errors  = 1000000

# DATA STORAGE #
datadir    = /media/freight/mysql

# CACHES AND LIMITS #
tmp-table-size  = 32M
max-heap-table-size = 32M
query-cache-type    = 0
query-cache-size    = 0
max-connections = 500
thread-cache-size   = 50
open-files-limit    = 65535
table-definition-cache  = 1024
table-open-cache    = 2048

# INNODB #
innodb-flush-method = O_DIRECT
innodb-log-files-in-group   = 2
innodb-log-file-size    = 512M
innodb-flush-log-at-trx-commit  = 1
innodb-file-per-table   = 1
innodb-buffer-pool-size = 100G
innodb-buffer-pool-instances    = 8

# LOGGING #
log-error   = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes   = 1
slow-query-log  = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log

# MISC #
basedir = /usr
tmpdir  = /tmp
lc-messages-dir = /usr/share/mysql
explicit_defaults_for_timestamp

# Recommended in standard MySQL setup
# sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_ALL_TABLES

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

I'll let this run for another day and then see what happens.

One more silly question, can you run the following query and share the results?

SELECT table_schema as `Database`, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`, table_rows as `Num Rows`  FROM information_schema.TABLES WHERE table_schema = 'sotorrent18_12'  ORDER BY (data_length + index_length) DESC;
sbaltes commented 5 years ago

Sure, here you go:

+----------------+------------------------+------------+-----------+
| Database       | Table                  | Size in MB | Num Rows  |
+----------------+------------------------+------------+-----------+
| sotorrent18_12 | PostBlockVersion       |  131258.86 | 202582014 |
| sotorrent18_12 | PostHistory            |  123914.00 | 111996687 |
| sotorrent18_12 | PostBlockDiff          |   77879.92 | 137823683 |
| sotorrent18_12 | Posts                  |   64010.97 |  40938286 |
| sotorrent18_12 | Comments               |   14023.00 |  67795178 |
| sotorrent18_12 | PostVersionUrl         |   13157.89 |  33994431 |
| sotorrent18_12 | Votes                  |   10854.95 | 161224192 |
| sotorrent18_12 | PostVersion            |    9296.88 |  67185215 |
| sotorrent18_12 | GHMatches              |    5793.64 |    599713 |
| sotorrent18_12 | TitleVersion           |    3522.86 |  19128135 |
| sotorrent18_12 | CommentUrl             |    2499.63 |   7186461 |
| sotorrent18_12 | PostReferenceGH        |    2056.27 |   6200977 |
| sotorrent18_12 | Badges                 |    1906.00 |  29041921 |
| sotorrent18_12 | Users                  |    1523.00 |   9292235 |
| sotorrent18_12 | PostLinks              |     439.08 |   5484805 |
| sotorrent18_12 | Tags                   |       3.52 |     52317 |
| sotorrent18_12 | PostType               |       0.02 |         9 |
| sotorrent18_12 | PostHistoryType        |       0.02 |        31 |
| sotorrent18_12 | PostBlockType          |       0.02 |         2 |
| sotorrent18_12 | PostBlockDiffOperation |       0.02 |         3 |
+----------------+------------------------+------------+-----------+
awwong1 commented 5 years ago

I gave up importing the data into mysql. It was faster for me to load everything into sqlite3 (this took around 2 full days).

https://github.com/awwong1/sotorrent-sqlite3

sbaltes commented 5 years ago

I gave up importing the data into mysql.

I'm sorry to hear that.

It was faster for me to load everything into sqlite3 (this took around 2 full days).

https://github.com/awwong1/sotorrent-sqlite3

I will update our system to Ubuntu 18.04 before importing the next SOTorrent release. Maybe I can reproduce the issues you faced. Until then, I'm happy to point others to the script you wrote. Thanks!

habedi commented 5 years ago

Hi,

I am sorry to intervene, just out of curiosity and pure interest, I have a few suggestions that may speed up the the whole loading process:

Trying other MySQL Storage Engines:

The default MySQL storage engine is called InnoDB, it is more suited for transactional workloads (e.g., lots of small concurrent reads and writes), but for analytic workloads such as running different queries to find a specific subset of the data (usually no writes and a lot of reads) other storage engines may perform better, I suggest Aria, it supports full-text indexing and search (SOTorrent dataset contains so much text) and it does not support foreign keys yet so I guess it takes less time to load the data into tables.

Removing the constraints from the table definitions:

I am assuming the integrity of SOTorrent data is very much checked and verified during its publications, so you could try to load the data into tables that have no primary or foreign keys (you know the relations exist in the data but you don't tell the database engine to check them every time you read or write a record). Again, InnoDB is not a good option here because it needs at least one attribute as a (pre-sorted) primary key and will create it itself implicitly if you don't provide it with a primary key in you table. You can try adding the constraints after loading the data.

Off-topic: trying to move the data to Elasticsearch Platform:

If you need more advanced features for searching in post texts I guess you could try Elasticsearch. Obviously, it has more advanced text processing and search capabilities than a RDBMS like MySQL and you could, for example, directly run advanced queries on posts themselves in it.

Notice: I haven't tried any of these myself, just had similar experience with loading huge tables into other databases.

sbaltes commented 5 years ago

Thanks for your feedback. One small comment regarding:

Removing the constraints from the table definitions

The foreign key checks are being disabled before loading the data, see, e.g.:

https://github.com/sotorrent/db-scripts/blob/bd72831de0fdef4f2ade7b86505daa06dcd063ea/sotorrent/2_load_so_from_xml.sql#L3

habedi commented 5 years ago

@sbaltes thanks for mentioning that. Actually, I was not quite sure how consequent disabling and then enabling the foreign key check for each load operation could affect the whole time needed for the process of loading all the tables, as in the helper script (6_load_sotorrent.sql) foreign key checks are disabled right before each load operation and then enabled right after the table has been loaded. I think it might be faster to disable the foreign key checks right at the start of the script then enabling it just at its end; the inside of 6_load_sotorrent.sql could look like something like this I guess:

SET FOREIGN_KEY_CHECKS=0;

-- CONTENTS OF CURRENT 6_load_sotorrent.sql WITHOUT TWEAKING THE FOREIGN KEY CHECKS SEPARATELY FOR EACH TABLE

SET FOREIGN_KEY_CHECKS=1;

I guess the main drawback in this scenario would be that the foreign key constraints on the tables are not verified until all the tables are loaded and this might cause late data inconsistency errors.

sbaltes commented 5 years ago

I will update our system to Ubuntu 18.04 before importing the next SOTorrent release. Maybe I can reproduce the issues you faced.

I'm currently uploading the new release 2019-03-17 to Zenodo. I created and tested this release on Windows 10 and Ubuntu 18.04 running MySQL Server 8.0. The import takes less than 24 hours on the above-mentioned hardware. I added some tips to the README file.