philipmat / discogs-xml2db

Imports the discogs.com monthly XML dumps into databases
Apache License 2.0
207 stars 76 forks source link

(WIP) Even faster .NET Parser #127

Open philipmat opened 4 years ago

philipmat commented 4 years ago
File Record Count Python C# v1 C# v2
discogs_20200806_artists.xml.gz 7,046,615 6:22 2:35 0:28 - 13x / 5x
discogs_20200806_labels.xml.gz 1,571,873 1:15 0:22 0:05 - 15x / 4x
discogs_20200806_masters.xml.gz 1,734,371 3:56 1:57 0:31 - 7x / 4x
discogs_20200806_releases.xml.gz 12,867,980 1:45:16 42:38 15:17 - 7x / 3x

That's just a part of the story. Parallel, processing is the other.

Screen Shot 2020-09-13 at 11 36 31 AM

As the screenshot above shows, the .NET version allows processing multiple files in parallel, achieving a 20x speedup from the Python version in this case.

MuleaneEve commented 4 years ago

Very cool to go from a process that used to take almost 2 hours to barely 15 minutes :)