ncbo / umls2rdf

These python scripts connect to the Unified Medical Language System (UMLS) database and translate the ontologies into RDF/OWL files. This is part of the BioPortal project.
http://bioportal.bioontology.org/
Other
84 stars 37 forks source link

Error MySQL dropping connection / running out of memory #6

Closed angellr closed 11 years ago

angellr commented 11 years ago

System: Ubuntu 12.04/6GB Ram; MySQL 5.5; UMLS 2013AA and have the entire UMLS database loaded. I had gone into the code to see if I could tweak and could not see anything easy to fix (yet). I did change the page size to 250000 to see if this helped with the "out of memory" errors, but they persisted. Not sure if connect -> fetch smaller chunks data -> disconnect would help this. Am at a loss to figure out how to proceed. I have changed as many MySQL parameters as I can think of ... also, I my system resources only exhaust about 4GB of memory when this runs. I am providing the output to the error message when I try the program against RXNORM and my MySQL my.cnf file located in /etc/mysql on the box. Any/all help in this process would be greatly appreciated. Thank you in advance for your help.

-Bob-

Here is output when trying to extract RXNORM ONLY as follows:

angellr@dDenali:~/sandbox/ontology/umls2rdf-master$ ./umls2rdf.py Generating RXNORM (virtual_id: 1423, using 'load_on_codes') [UMLS-Query] SELECT * FROM MRCONSO WHERE SAB = 'RXNORM' AND lat = 'ENG' LIMIT 500000 OFFSET 0 [UMLS-Query] SELECT * FROM MRCONSO WHERE SAB = 'RXNORM' AND lat = 'ENG' LIMIT 500000 OFFSET 500000 [UMLS-Query] SELECT * FROM MRCONSO WHERE SAB = 'RXNORM' AND lat = 'ENG' LIMIT 500000 OFFSET 1000000 [UMLS-Query] SELECT * FROM MRCONSO WHERE SAB = 'SRC' AND CODE = 'V-RXNORM' LIMIT 500000 OFFSET 0 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 0 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 500000 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 1000000 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 1500000 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 2000000 [UMLS-Query] SELECT * FROM MRREL WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 2500000 [UMLS-Query] SELECT * FROM MRDEF WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 0 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 0 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 500000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 1000000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 1500000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 2000000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 2500000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 3000000 [UMLS-Query] SELECT * FROM MRSAT WHERE SAB = 'RXNORM' AND CODE IS NOT NULL LIMIT 500000 OFFSET 3500000 Out of memory (Needed 571480 bytes) [UMLS-Query] SELECT * FROM MRRANK WHERE SAB = 'RXNORM' LIMIT 500000 OFFSET 0 Traceback (most recent call last): File "./umls2rdf.py", line 587, in ont.load_tables() File "./umls2rdf.py", line 456, in load_tables for rank in mrrank.scan(filt=mrrank_filt): File "./umls2rdf.py", line 187, in scan cursor.execute(q) File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.OperationalError: (2013, 'Lost connection to MySQL server during query') angellr@dDenali:~/sandbox/ontology/umls2rdf-master$

My MySQL config file (with comments on some of the parameters that I have changed) as follows:

The MySQL database server configuration file.

You can copy this to one of:

- "/etc/mysql/my.cnf" to set global options,

- "~/.my.cnf" to set user-specific options.

One can use all long options that the program supports.

Run program with --help to get a list of available options and with

--print-defaults to see which it would actually understand and use.

For explanations see

http://dev.mysql.com/doc/mysql/en/server-system-variables.html

This will be passed to all mysql clients

It has been reported that passwords should be enclosed with ticks/quotes

escpecially if they contain "#" chars...

Remember to edit /etc/mysql/debian.cnf when changing the socket location.

[client] port = 3306 socket = /var/run/mysqld/mysqld.sock loose-local-infile = 1 user = XXXXXXXX

password = XXXXXXXX

database = umls

Here is entries for some specific programs

The following values assume you have at least 32M ram

This was formally known as [safe_mysqld]. Both versions are currently parsed.

[mysqld_safe] socket = /var/run/mysqld/mysqld.sock nice = 0

[mysqld]

* Basic Settings

user = mysql pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock port = 3306 basedir = /usr datadir = /var/lib/mysql tmpdir = /tmp lc-messages-dir = /usr/share/mysql skip-external-locking

Instead of skip-networking the default is now to listen only on

localhost which is more compatible and is not less secure.

bind-address = 127.0.0.1

* Fine Tuning

key_buffer = 600M

table_cache = 300

table_cache = 900 (increased this from 600 -> 900)

sort_buffer_cache = 20M

read_buffer_cache = 200M

query_cache_limit = 3M

query_cache_size = 100M

query_cache_limit = 5M (increased from 3 -> 5M) query_cache_size = 300M (increased from 100 -> 300) myisam_sort_buffer_size = 200M bulk_insert_buffer_size = 100M join_buffer_size = 100M

key_buffer = 16M

max_allowed_packet = 64M (was 16M ... increased to 32, then 64) thread_stack = 192K thread_cache_size = 8

This replaces the startup script and checks MyISAM tables if needed

the first time they are touched

myisam-recover = BACKUP

max_connections = 100

table_cache = 64

thread_concurrency = 10

* Query Cache Configuration

query_cache_limit = 1M

query_cache_size = 16M

* Logging and Replication

Both location gets rotated by the cronjob.

Be aware that this log type is a performance killer.

As of 5.1 you can enable the log at runtime!

general_log_file = /var/log/mysql/mysql.log

general_log = 1

Error log - should be very few entries.

log_error = /var/log/mysql/error.log

Here you can see queries with especially long duration

log_slow_queries = /var/log/mysql/mysql-slow.log

long_query_time = 2

log-queries-not-using-indexes

The following can be used as easy to replay backup logs or for replication.

note: if you are setting up a replication slave, see README.Debian about

other settings you may need to change.

server-id = 1

log_bin = /var/log/mysql/mysql-bin.log

expire_logs_days = 10 max_binlog_size = 100M

binlog_do_db = include_database_name

binlog_ignore_db = include_database_name

* InnoDB

InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.

Read the manual for more InnoDB related options. There are many!

* Security Features

Read the manual, too, if you want chroot!

chroot = /var/lib/mysql/

For generating SSL certificates I recommend the OpenSSL GUI "tinyca".

ssl-ca=/etc/mysql/cacert.pem

ssl-cert=/etc/mysql/server-cert.pem

ssl-key=/etc/mysql/server-key.pem

[mysqldump] quick quote-names max_allowed_packet = 16M

[mysql]

no-auto-rehash # faster start of mysql but no tab completition

[isamchk] key_buffer = 16M

* IMPORTANT: Additional settings that can override those from this file!

The files must end with '.cnf', otherwise they'll be ignored.

!includedir /etc/mysql/conf.d/

msalvadores commented 11 years ago

UMLS2RDF is meant to run in a server setting. UMLS2RDF trades memory for speed loading each ontologies at once in memory. We run the mysql server and UMLS2RDF in separate boxes both with 32G RAM.

6G RAM for both mysql and umls2rdf seems not sufficient.

angellr commented 11 years ago

Mr. Salvadores:

Is there a work around to the memory issue? As this is a one-time data dump, I would rather not have to improve my hardware to make this work. Without digging through the code, could the ontology be built in stages? I just need RXNorm and NDFRT from the UMLS. Appreciate all your help in this matter. Thank you.

-Bob-

Bob Angell 1238 Fenway Avenue Salt Lake City, Utah 84102-3212 SLC (801) 583-8544 / NYC (347) 788-8544 cell: (801) 706-2520 regular email: bangell_AT_gmail.com

LinkedIn: http://linkd.in/bmS7G8

In an effort to reduce email volume, no need to reply to this message.

LinkedIn: http://linkd.in/bmS7G8

“Those who would trade essential liberty for some temporary security; deserve neither.” AND "Democracy is two wolves and a lamb voting on what to have for lunch. Liberty is a well-armed lamb contesting the vote." -- Benjamin Franklin "Malo Periculosam Libertatem Quam Quietum Servitium" "Non Illegitimus Carborundum Est"; "Had Momma Cass and Karen Carpenter shared that ham sandwich, they might both be with us today!"; "Failure is only an event" - Zig Ziglar

On Oct 5, 2013, at 11:41 AM, Manuel Salvadores notifications@github.com wrote:

UMLS2RDF is meant to run in a server setting. UMLS2RDF trades memory for speed loading each ontologies at once in memory. We run the mysql server and UMLS2RDF in separate boxes both with 32G RAM.

6G RAM for both mysql and umls2rdf seems not sufficient.

— Reply to this email directly or view it on GitHub.

msalvadores commented 11 years ago

In BioPortal we use UMLS2RDF internally to load ontologies in our backend.

Most ontologies are avalaible for download as well. Some of them aren't due to licensing issues. You can contact the BioPortal support list if you run into issues downloading the ontologies you need.

http://bioportal.bioontology.org/ontologies/ http://bioportal.bioontology.org/ontologies/RXNORM http://bioportal.bioontology.org/ontologies/NDFRT

angellr commented 11 years ago

This is the error one receives when visiting these links ….

Thank you again for all your help.

-Bob-

Bob Angell 1238 Fenway Avenue Salt Lake City, Utah 84102-3212 SLC (801) 583-8544 / NYC (347) 788-8544 cell: (801) 706-2520 regular email: bangell_AT_gmail.com

LinkedIn: http://linkd.in/bmS7G8

In an effort to reduce email volume, no need to reply to this message.

LinkedIn: http://linkd.in/bmS7G8

“Those who would trade essential liberty for some temporary security; deserve neither.” AND "Democracy is two wolves and a lamb voting on what to have for lunch. Liberty is a well-armed lamb contesting the vote." -- Benjamin Franklin "Malo Periculosam Libertatem Quam Quietum Servitium" "Non Illegitimus Carborundum Est"; "Had Momma Cass and Karen Carpenter shared that ham sandwich, they might both be with us today!"; "Failure is only an event" - Zig Ziglar

On Oct 5, 2013, at 12:22 PM, Manuel Salvadores notifications@github.com wrote:

In BioPortal we use UMLS2RDF internally to load ontologies in our backend.

Most ontologies are avalaible for download as well. Some of them aren't due to licensing issues. You can contact the BioPortal support list if you run into issues downloading the ontologies you need.

http://bioportal.bioontology.org/ontologies/ http://bioportal.bioontology.org/ontologies/RXNORM http://bioportal.bioontology.org/ontologies/NDFRT

— Reply to this email directly or view it on GitHub.

msalvadores commented 11 years ago

I cannot reproduce those errors. Try again please.

angellr commented 11 years ago

Manuel,

The purl finally worked … must have been the server upgrade or something on their end.

-Bob-

Bob Angell 1238 Fenway Avenue Salt Lake City, Utah 84102-3212 SLC (801) 583-8544 / NYC (347) 788-8544 cell: (801) 706-2520 regular email: bangell_AT_gmail.com

LinkedIn: http://linkd.in/bmS7G8

In an effort to reduce email volume, no need to reply to this message.

LinkedIn: http://linkd.in/bmS7G8

“Those who would trade essential liberty for some temporary security; deserve neither.” AND "Democracy is two wolves and a lamb voting on what to have for lunch. Liberty is a well-armed lamb contesting the vote." -- Benjamin Franklin "Malo Periculosam Libertatem Quam Quietum Servitium" "Non Illegitimus Carborundum Est"; "Had Momma Cass and Karen Carpenter shared that ham sandwich, they might both be with us today!"; "Failure is only an event" - Zig Ziglar

On Oct 5, 2013, at 1:55 PM, Manuel Salvadores notifications@github.com wrote:

I cannot reproduce those errors. Try again please.

— Reply to this email directly or view it on GitHub.