philipmat / discogs-xml2db

Imports the discogs.com monthly XML dumps into databases
Apache License 2.0
207 stars 76 forks source link

Add MySQL support to speedup #98

Closed berz closed 4 years ago

berz commented 4 years ago

This PR adds support for importing the discogs dumps into MySQL.

To do this I first had to refactor the code a bit to decouple csv exporting code from postgres importing code. PostgreSQL specific code was moved into a /postgresql directory. This makes it easier to add support for other databases in the future.

The new code for importing the csv's into mysql can be found in /mysql.

I also updated the README to reflect these changes.

berz commented 4 years ago

I tested the import on both PostgreSQL and MySQL with the current code and both work.

Postgres does throw some errors while applying constraints and building indices to the full dump for may 2020, but this seems to be unrelated to my changes.

$ python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
ERROR:  insert or update on table "release_artist" violates foreign key constraint "release_artist_fk_artist"
    DETAIL:  Key (artist_id)=(0) is not present in table "artist".
ERROR:  insert or update on table "release_track_artist" violates foreign key constraint "release_track_artist_fk_artist"
    DETAIL:  Key (artist_id)=(118760) is not present in table "artist".
ERROR:  insert or update on table "release_company" violates foreign key constraint "release_company_fk_company"
    DETAIL:  Key (company_id)=(1363089) is not present in table "label".

$ python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql
ERROR:  index row size 5376 exceeds maximum 2712 for index "release_track_idx_title"
    HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
    Consider a function index of an MD5 hash of the value, or use full text indexing.
berz commented 4 years ago

Sorry for the noise, seems I should not be rebasing my commits after sending a PR.

ijabz commented 4 years ago

Just trying it now

Just running: nohup python3 speedup/exporter.py --bz2 --debug dumps csv-dir --export artist --export release --export label --export master &

I did have a few issues trying to run exporter.py, my first very similar attempts failed either giving error or just doing nothing. I wonder if it could default to exporting all four files so you don't have to specify them ?

ijabz commented 4 years ago

Now running import.csv - it is working but three points

  1. You put the instructions for installing pv after starting the importer, so I had to stop the importer, install pv, drop tables, create tables and try again, would be better if instructions came first.
  2. But I am not seeing the pv output anyway (im using a version of Ubuntu provided by AWS)
  3. More importantly if I run from the command line I see a line for each table imported, but if I run using nohup there is nothing in the nohup file why would this be. This is important because I always run this long running process with nohup incase my connection to the remote machine I run this on drops (in contrast there is no such problem with no output during the exporter stage)
berz commented 4 years ago

Thanks for having a look. Could you tell me which errors you are getting?

I too dislike the --export arguments for speedup/exporter.py. I kept it the way it was because it was not essential for adding mysql support. I would prefer to just pass paths, for example

# to export only the artist file
python3 speedup/exporter.py --bz2 dumps/artist.xml.gz csv-dir
# to export all dump files
python3 speedup/exporter.py --bz2 dumps/*.xml.gz csv-dir

If that's ok for you I should be able to make it work that way.

Is the output shown correctly if pv is not installed on the machine? pv is intended to be used to provide realtime progress info so it could cause issues when the output is written to a log file instead of shown on screen in real time. We might want a switch to turn it on instead of using it if it is installed. I don't have time right now to look into this but the problem could be caused by pv writing to STDERR and nohup only logging STDOUT (or the inverse) or pv using escape characters.

I will try to find some time in the following days to make these changes, fix the readme file, and look into the pv output issue.

ijabz commented 4 years ago

Done as works and is a good improvement, the minor issues I have pointed out can be done in another pull request.