mycognitive / ads_features

Provides common features for ads distribution.
http://drupal.org/project/ads_features
1 stars 4 forks source link

taxonomy_csv_import is too slow #30

Open kenorb opened 10 years ago

kenorb commented 10 years ago

After installation change project.drupal.configure.locations.countries to GB in your build.properties and run: phing post-install-drupal. It's very slow and adding only few items a second.

[exec] There were no extensions that could be enabled.                      [ok]
[exec] Command dispatch complete                                               [notice]

We need to check the performance (profiler) of ads_locations_ads_configure(). Also the script doesn't print any appropriate message that it's importing anything. We need also some simple command which helps with importing via drush, without creating json file, which makes it more complicated that it is if you want to import the specific country on depand (without running the script).

kenorb commented 10 years ago

It seems for each item, whole Drupal core is initialized:

 [exec] Initialized Drupal 7.31 root directory at /home/travis/ads/src          [notice]
 [exec] Initialized Drupal site default at sites/default                        [notice]
 [exec] Line 85 of 92 processed: "AD", "Ordino", "Segudet"                   [ok]
 [exec] Command dispatch complete                                               [notice]
 [exec] /usr/bin/php  -d magic_quotes_gpc=Off -d magic_quotes_runtime=Off -d    [notice]
 [exec] magic_quotes_sybase=Off /usr/share/php/drush/drush.php
 [exec] --php=/usr/bin/php --php-options=' -d magic_quotes_gpc=Off -d
 [exec] magic_quotes_runtime=Off -d magic_quotes_sybase=Off'  --backend=2
 [exec] --verbose --yes --root=/home/travis/ads/src --uri=http://default
 [exec] batch-process 24 24 2>&1
 [exec] Initialized Drupal 7.31 root directory at /home/travis/ads/src          [notice]
 [exec] Initialized Drupal site default at sites/default                        [notice]
 [exec] Line 86 of 92 processed: "AD", "Sant Julia De Loria", "Sertes"       [ok]
 [exec] Command dispatch complete                                               [notice]
 [exec] /usr/bin/php  -d magic_quotes_gpc=Off -d magic_quotes_runtime=Off -d    [notice]
 [exec] magic_quotes_sybase=Off /usr/share/php/drush/drush.php
 [exec] --php=/usr/bin/php --php-options=' -d magic_quotes_gpc=Off -d
 [exec] magic_quotes_runtime=Off -d magic_quotes_sybase=Off'  --backend=2
 [exec] --verbose --yes --root=/home/travis/ads/src --uri=http://default
 [exec] batch-process 24 24 2>&1
nseam commented 10 years ago

Accepted

kenorb commented 10 years ago

Import can be performed manually by:

$ phing configure
$ drush --verbose --debug --yes ads-configure `drush eval "print  sys_get_temp_dir() . '/ads-configure.properties.json'"`

Line 200 of 17705 processed: "GB", "Buckinghamshire", "Adstock" [79.07 sec, 9.5 MB]                                                                                                                                                                                                                                [ok]
Line 1000 of 17705 processed: "GB", "Strabane", "Ballymagorry" [353.21 sec, 13.64 MB]                                                                                                                                                                                                                              [ok]

Import ~2.83/sec on OSX 2.3 GHz Intel Core i7 (PHP 5.5.3)

All GB (17705) can take around 14h, all files (3034125) - 14 days.

nseam commented 10 years ago

Profile output:

mc-iupairsd 1

mc-j5glkz7y 1

mc-d5jkpxmp 1

mc-bqiivcrt 1

mc-slbo97lp 1

  1. Possible solution:
    • Load CSV file into temporary table,
    • Custom SQL code with single INSERT for every column in CSV file performing a tree uniqueness check (there shouldn't be duplicates for a tree of terms, i.e. for GB locations we just need single "GB" term. For GB -> A -> B and GB -> A -> C we just need single A because they are both inside GB. For GB -> A -> B and AU -> A -> B we need two A's and two B's because they have at least one different parent). The problem is that we need to do non-cached SELECT subquery to the same taxonomy_term_data table in order to check tree uniqueness for every inserted record.
  2. Possible solution:
    • Re-export CSV files into MySQL inserts with very huge tids, i.e. 1000000+ Then when importing:
    • Save current MAX tid
    • After import do UPDATE for all terms with tid >= 1000000 and update tid to: MAX tid + current.tid - 1000000
kenorb commented 10 years ago

Possible modules slowing down: shs, pathauto, token. After disabling them, import is speed up by ~16x. We need to disable them before the import, I'm not sure which one slowing down the most.