sg-wbi / belb

Biomedical Entity Linking Benchmark
Other
10 stars 0 forks source link

build_kbs.py error #3

Open droidlyx opened 1 week ago

droidlyx commented 1 week ago

Hello, I encountered an error with SQL when running python -m belb.scripts.build_kbs --dir . --cores 20 --umls ../2017AA-full/2017AA/META --db ./db.yaml:

2024-11-19 15:16:44.868 | INFO     | belb.kbs.kb:to_belb:211 - Start converting ctd_diseases to BELB format...
2024-11-19 15:16:44.868 | INFO     | belb.kbs.kb:write_table:144 - Start writing "kb" table file...
2024-11-19 15:16:45.049 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 10000 entries...
2024-11-19 15:16:45.253 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 20000 entries...
2024-11-19 15:16:45.452 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 30000 entries...
2024-11-19 15:16:45.645 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 40000 entries...
2024-11-19 15:16:45.837 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 50000 entries...
2024-11-19 15:16:46.032 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 60000 entries...
2024-11-19 15:16:46.228 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 70000 entries...
2024-11-19 15:16:46.422 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 80000 entries...
2024-11-19 15:16:46.614 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 90000 entries...
2024-11-19 15:16:46.629 | INFO     | belb.kbs.kb:write_table:183 - Complted writing "kb" table: 90757 total entries.
2024-11-19 15:16:46.629 | INFO     | belb.kbs.kb:write_table:144 - Start writing "identifier_mapping" table file...
2024-11-19 15:16:46.716 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 10000 entries...
2024-11-19 15:16:46.808 | INFO     | belb.kbs.kb:write_table:176 - #PROGRESS: written 20000 entries...
2024-11-19 15:16:46.821 | INFO     | belb.kbs.kb:write_table:183 - Complted writing "identifier_mapping" table: 21295 total entries.
2024-11-19 15:16:46.822 | INFO     | belb.kbs.kb:__init__:300 - Database was not initialized. Call "init_database" before anything else...
2024-11-19 15:16:46.831 | INFO     | belb.kbs.kb:init_database:371 - Initilializing knowledge base database...
Traceback (most recent call last):
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1933, in _exec_single_context
    self.dialect.do_executemany(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 740, in do_executemany
    cursor.executemany(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: ctd_diseases_kb.identifier, ctd_diseases_kb.name

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bje/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/bje/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/scripts/build_kbs.py", line 113, in <module>
    main()
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/scripts/build_kbs.py", line 109, in main
    NAME_TO_KB_MODULE[kb.name].main(args)
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/kbs/ctd_diseases/ctd_diseases.py", line 154, in main
    handle.init_database()
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/kbs/kb.py", line 384, in init_database
    self.to_database(
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/kbs/kb.py", line 358, in to_database
    self.populate_table(table=table, df=df)
  File "/mnt/data3/lyx_NER/data/Knowledge_Bases/belb/belb/kbs/db.py", line 240, in populate_table
    self.connection.execute(table.insert(), df.to_dict("records"))
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1414, in execute
    return meth(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 487, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1638, in _execute_clauseelement
    ret = self._execute_context(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1842, in _execute_context
    return self._exec_single_context(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1983, in _exec_single_context
    self._handle_dbapi_exception(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2325, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1933, in _exec_single_context
    self.dialect.do_executemany(
  File "/mnt/data3/lyx_NER/envs/lyx_torch/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 740, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: ctd_diseases_kb.identifier, ctd_diseases_kb.name
[SQL: INSERT INTO ctd_diseases_kb (uid, identifier, description, name) VALUES (?, ?, ?, ?)]
[parameters: [(0, 0, 0, '10p Deletion Syndrome (Partial)'), (1, 0, 1, 'Chromosome 10, 10p- Partial'), (2, 0, 1, 'Chromosome 10, monosomy 10p'), (3, 0, 1, 'Chromosome 10, Partial Deletion (short arm)'), (4, 0, 1, 'Monosomy 10p'), (5, 1, 0, '13q deletion syndrome'), (6, 1, 1, 'Chromosome 13q deletion'), (7, 1, 1, 'Chromosome 13q deletion syndrome')  ... displaying 10 of 90757 total bound parameter sets ...  (90755, 13297, 1, 'Phycomycosis'), (90756, 13297, 1, 'Zygomycoses')]]
(Background on this error at: https://sqlalche.me/e/20/gkpj)
sg-wbi commented 5 days ago

Thank you for trying out BELB.

From the error it looks like the problem is ctd_diseases_kb:

sqlite3.IntegrityError: UNIQUE constraint failed: ctd_diseases_kb.identifier, ctd_diseases_kb.name

Is it possible that this is not the first time you run the script?

IIRC there's no mechanism to skip creating a KB if it's already there, which would explain the error: the script is trying to add duplicate data.

Can you try changing --dir .?

TODO: Add a check for existing KB here

droidlyx commented 5 days ago

But even after I change folder, or delete and reinstall belb the error still persists It's seems that the error is in the populate_table step

droidlyx commented 5 days ago

There's a unique constraint set in schema to prevent the same value of both name and identifier, but the entries in populate_table function contains entries of both the same name and identifier. I see there's a drop_duplicates function in kb.py but not actually executed, maybe it should be executed? Yes, I can run successfully after setting the dedup parameter in to_database function in kbs.py to True, I don't know if this is intended, but it's set to false by default

droidlyx commented 5 days ago

Wait, when it comes to NCBI gene, the code says cannot perform deduplication when reading data by chunks (i.e. chunksize>0), so there's still duplication and raised UNIQUE constraint failed error