roskakori / pimdb

build a database from IMDb datasets
https://pypi.org/project/pimdb/
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Fix error: Column length too big #39

Closed RealTehreal closed 6 months ago

RealTehreal commented 1 year ago

Hey folks,

I tried to transfer the IMDB TSV files to a fresh MariaDB database and ran into the following error. Any help with resolving the error is very appreciated. Thank you very much in advance.

$ pimdb transfer --database "mariadb://root@localhost/imdb?unix_socket=/opt/lampp/var/mysql/mysql.sock" all

Output:

INFO:pimdb:connecting to database mariadb://root@localhost/imdb?unix_socket=/opt/lampp/var/mysql/mysql.sock (other)
INFO:pimdb:creating imdb dataset tables
Traceback (most recent call last):
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py", line 209, in execute
    res = self._query(query)
  File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py", line 315, in _query
    db.query(q)
  File "/usr/lib/python3/dist-packages/MySQLdb/connections.py", line 239, in query
    _mysql.connection.query(self, query)
MySQLdb._exceptions.OperationalError: (1074, "Column length too big for column 'directors' (max = 21844); use BLOB or TEXT instead")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/realtehreal/.local/bin/pimdb", line 8, in <module>
    sys.exit(main())
  File "/home/realtehreal/.local/lib/python3.10/site-packages/pimdb/command.py", line 295, in main
    sys.exit(exit_code_for())
  File "/home/realtehreal/.local/lib/python3.10/site-packages/pimdb/command.py", line 281, in exit_code_for
    command_class(parser, args).run()
  File "/home/realtehreal/.local/lib/python3.10/site-packages/pimdb/command.py", line 186, in __init__
    self._database.create_imdb_dataset_tables()
  File "/home/realtehreal/.local/lib/python3.10/site-packages/pimdb/database.py", line 539, in create_imdb_dataset_tables
    self.metadata.create_all()
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/schema.py", line 4741, in create_all
    bind._run_ddl_visitor(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2979, in _run_ddl_visitor
    conn._run_ddl_visitor(visitorcallable, element, **kwargs)
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2016, in _run_ddl_visitor
    visitorcallable(self.dialect, self, **kwargs).traverse_single(element)
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/visitors.py", line 483, in traverse_single
    return meth(obj, **kw)
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/ddl.py", line 846, in visit_metadata
    self.traverse_single(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/visitors.py", line 483, in traverse_single
    return meth(obj, **kw)
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/ddl.py", line 890, in visit_table
    self.connection.execute(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1200, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/sql/ddl.py", line 77, in _execute_on_connection
    return connection._execute_ddl(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1290, in _execute_ddl
    ret = self._execute_context(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1748, in _execute_context
    self._handle_dbapi_exception(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 198, in raise_
    raise exception
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1705, in _execute_context
    self.dialect.do_execute(
  File "/home/realtehreal/.local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 681, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py", line 209, in execute
    res = self._query(query)
  File "/usr/lib/python3/dist-packages/MySQLdb/cursors.py", line 315, in _query
    db.query(q)
  File "/usr/lib/python3/dist-packages/MySQLdb/connections.py", line 239, in query
    _mysql.connection.query(self, query)
sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1074, "Column length too big for column 'directors' (max = 21844); use BLOB or TEXT instead")
[SQL: 
CREATE TABLE `TitleCrew` (
    tconst VARCHAR(12) NOT NULL, 
    directors VARCHAR(26623), 
    writers VARCHAR(26623), 
    PRIMARY KEY (tconst)
)COMMENT='IMDb dataset title.crew.tsv.gz'

]
(Background on this error at: http://sqlalche.me/e/14/e3q8)

The link at the end brings me nowhere. I took a look at pimdb's source code as well, but I couldn't find where those "26623"-values are defined.

Kind regards

RealTehreal commented 1 year ago

After diving through pimdb's code, I have to admit, that it's not optimal for use with MariaDB. I had to change some things to make it work. I'll see if it's possible to make the changes in a way, that other DBMS are not affected, and in this case, create a pull request.

roskakori commented 1 year ago

@RealTehreal Took me a while to take a look at this. The culprit seems to be the following limit:

Column length too big for column 'directors' (max = 21844); use BLOB or TEXT instead

One way to fix this would be to increase the limits in database.py, especially the one for the directors field.

(
            ImdbDataset.TITLE_CREW,
            [
                Column("tconst", String(_TCONST_LENGTH), nullable=False, primary_key=True),
                Column("directors", String((_NCONST_LENGTH + 1) * _CREW_COUNT - 1)),  # <-- FIXME: Increase _NCONST_LENGTH
                Column("writers", String((_NCONST_LENGTH + 1) * _CREW_COUNT - 1)),
            ],
        ),

But this might just move the error to another field.

SQLAlchemy does provide a Text type that in practice has no limit or a very high one. Replacing all the String(<SOME_LIMIT>) with simply Text() should solve this for good. Though in my experience some databases still have an implicit limit of about 4000 or 32000 characters.

Feel free to try this out with Text and if it work submit a pull request.

RealTehreal commented 1 year ago

I already tried to increase the limits, and just as mentioned, the issue moved from field to field. After hours of trial and error, I finally gave up on it.

Greets

roskakori commented 1 year ago

@RealTehreal Did you try replacing the String(...) with Text() too?

RealTehreal commented 1 year ago

@RealTehreal Did you try replacing the String(...) with Text() too?

I cannot tell anymore.

roskakori commented 1 year ago

@RealTehreal Did you try replacing the String(...) with Text() too?

I cannot tell anymore.

Alright, thanks. I guess I just change it to Text, test it with the databases that are easily accessible to me and then hope for the best concerning MariaDB.

roskakori commented 6 months ago

The char fields are now of type text, so there length is only limited by physical limits of the database.