piskvorky / sim-shootout

Code for "Performance shootout between nearest-neighbour libraries": http://radimrehurek.com/2013/11/performance-shootout-of-nearest-neighbours-intro
MIT License
100 stars 28 forks source link

Add pageid to process_article #3

Closed lukaselmer closed 9 years ago

lukaselmer commented 9 years ago

https://github.com/piskvorky/gensim/commit/6783b813408acc4e04ebe0603192c0d76508b048#diff-eece52d95c280dabe57c803c95d6bb96 introduced an additional value pageid which breaks this code (Too many values to unpack)

piskvorky commented 9 years ago

How have you tested this, is everything working? Sorry this repo doesn't have tests, it was a one-off project, so I have to rely on your thoroughness :)

lukaselmer commented 9 years ago

How have you tested this?

I've installed the latest version of gensim and ran the contents of https://github.com/piskvorky/sim-shootout/blob/master/run_all.sh. Than pool.imap didn't work ("Too many values to unpack"), and I noticed that gensim changed the return parameters from 2 to 3.

Is everything working?

I don't know yet, sorry. The script is still running. There seem to be many invalid lines though, e.g. "INFO : invalid line at title List of municipalities in Espírito Santo").

piskvorky commented 9 years ago

Hmm, I don' think that should be happening, something must be wrong... Could it be some non-utf8 characters raising exceptions?

lukaselmer commented 9 years ago

Hmm, ok. Yes, looks like. I'll look into it, thanks.

piskvorky commented 9 years ago

Thanks! I assume this fixes the "invalid line" INFO lines?

lukaselmer commented 9 years ago

Exactly. I will notify you as soon as the script finishes.

lukaselmer commented 9 years ago

I finally got the exception described here: https://groups.google.com/forum/#!msg/gensim/IJVpAtshWEA/jnMCsqx_sb8J

I will try to solve it and include it in the PR.

piskvorky commented 9 years ago

That will be great. Thanks for taking the time to clean this up, Lukas!

lukaselmer commented 9 years ago

Couldn't do it if you didn't provide the code in the first place. Thanks! :+1:

lukaselmer commented 9 years ago

It's looking good at the moment, but it'll take some time until it finishes. I'll keep you posted.

./prepare_shootout.py ./data/enwiki-latest-pages-articles.xml.bz2 ./data
2015-06-12 13:45:03,377 : INFO : running ./prepare_shootout.py ./data/enwiki-latest-pages-articles.xml.bz2 ./data
2015-06-12 13:45:03,381 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-06-12 13:46:23,876 : INFO : adding document #10000 to Dictionary(432846 unique tokens: [u'biennials', u'tripolitan', u'oblocutor', u'woode', u'maderista']...)
2015-06-12 13:47:34,274 : INFO : adding document #20000 to Dictionary(613459 unique tokens: [u'biennials', u'tripolitan', u'oblocutor', u'shatzky', u'woode']...)
2015-06-12 13:48:29,965 : INFO : adding document #30000 to Dictionary(745897 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:49:17,699 : INFO : adding document #40000 to Dictionary(863174 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:49:50,833 : INFO : adding document #50000 to Dictionary(940710 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:12,595 : INFO : adding document #60000 to Dictionary(958082 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:30,817 : INFO : adding document #70000 to Dictionary(975116 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:47,549 : INFO : adding document #80000 to Dictionary(989726 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:51:26,818 : INFO : adding document #90000 to Dictionary(1069301 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:52:11,558 : INFO : adding document #100000 to Dictionary(1164317 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:52:51,811 : INFO : adding document #110000 to Dictionary(1251317 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:53:29,382 : INFO : adding document #120000 to Dictionary(1328898 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'cyclophophamide']...)
2015-06-12 13:54:04,843 : INFO : adding document #130000 to Dictionary(1395108 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'cyclophophamide']...)
2015-06-12 13:54:43,592 : INFO : adding document #140000 to Dictionary(1470445 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:55:20,452 : INFO : adding document #150000 to Dictionary(1553531 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:55:55,882 : INFO : adding document #160000 to Dictionary(1625392 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:56:26,887 : INFO : adding document #170000 to Dictionary(1684152 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:56:58,508 : INFO : adding document #180000 to Dictionary(1734059 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:57:28,090 : INFO : adding document #190000 to Dictionary(1788162 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:57:56,621 : INFO : adding document #200000 to Dictionary(1842811 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:58:26,796 : INFO : adding document #210000 to Dictionary(1891262 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:58:56,738 : INFO : adding document #220000 to Dictionary(1934119 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:59:24,603 : INFO : adding document #230000 to Dictionary(1982979 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:59:55,508 : INFO : discarding 31792 tokens: [(u'drywells', 1), (u'ziegelheim', 1), (u'barakjay', 1), (u'marangell', 1), (u'fuentelencina', 1), (u'psarakia', 1), (u'nambaudus', 1), (u'plumosity', 1), (u'barij', 1), (u'hiccou', 1)]...
2015-06-12 13:59:55,508 : INFO : keeping 2000000 tokens which were in no less than 0 and no more than 240000 (=100.0%) documents
2015-06-12 14:00:00,236 : INFO : resulting dictionary: Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:00,282 : INFO : adding document #240000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:34,517 : INFO : discarding 40486 tokens: [(u'beatsuite', 1), (u'fu\u017ei', 1), (u'gyeongi', 1), (u'keser\xfc', 1), (u'tetyukhinsky', 1), (u't\xf6teberg', 1), (u'mccalligog', 1), (u'drammakins', 1), (u'dreadnort', 1), (u'cdci', 1)]...
2015-06-12 14:00:34,517 : INFO : keeping 2000000 tokens which were in no less than 0 and no more than 250000 (=100.0%) documents
2015-06-12 14:00:38,684 : INFO : resulting dictionary: Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:38,734 : INFO : adding document #250000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
lukaselmer commented 9 years ago

Seems like it worked!

2015-06-13 01:27:05,819 : INFO : PROGRESS: saving document #3830000
2015-06-13 01:27:09,248 : INFO : PROGRESS: saving document #3831000
2015-06-13 01:27:12,572 : INFO : PROGRESS: saving document #3832000
2015-06-13 01:27:15,919 : INFO : PROGRESS: saving document #3833000
2015-06-13 01:27:16,531 : INFO : saved 3833255x500 matrix, density=100.000% (1916624364/1916627500)
2015-06-13 01:27:16,539 : INFO : saving MmCorpus index to ./data/lsi_vectors.mm.index
2015-06-13 01:27:17,709 : INFO : finished running prepare_shootout.py
piskvorky commented 9 years ago

Much obliged, merging :+1: