roskakori / pimdb

build a database from IMDb datasets
https://pypi.org/project/pimdb/
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

TitleAkas - value too long #18

Closed chassing closed 4 years ago

chassing commented 4 years ago

similar issue to #13 but this time on TitleAkas:

...
sqlalchemy.exc.DataError: (psycopg2.DataError) value too long for type character varying(512)

[SQL: INSERT INTO "TitleAkas" ("titleId", ordering, title, region, language, types, attributes, "isOriginalTitle") VALUES (%(titleId_m0)s, %(ordering_m0)s, %(title_m0)s, %(region_m0)s, %(language
_m0)s, %(types_m0)s, %(attributes_m0)s, %(isOriginalTitle_m0)s), (%(titleId_m1)s, %(ordering_m1)s, %(title_m1)s, %(region_m1)s, %(language_m1)s, %(types_m1)s, %(attributes_m1)s, %(isOriginalTitle
_m1)s), (%(titleId_m2)s,
....
roskakori commented 4 years ago

Oh well, didn't want to do it on my laptop but I think I got all of the issues now. Running a postgres test import of the current datasets at the moment. This will take a while though...

Apparently some porn producer decided to paste the description as alternative title, resulting in a length of 831 (tconst=tt9109230).

roskakori commented 4 years ago

The good news: transfer with postgres works now. The fixed version is on PyPI.

The bad news is the performance, especially for title.principals.tsv.gz. On my oldish 2016 MacBook Pro it eventually says:

INFO:pimdb: added 38756173 rows in 493:50.602 (1307.98 rows per second)

That's more than 8 hours... 😞

If you really want to go through this, remember: unlike with SQLite with Postgres you can run multiple "pimdb transfer" in parallel, each time specifying a different dataset. Specifying "pimdb transfer all" transfers all with on command but does so sequentially, which takes a few hours more for the whole process to finish.