Closed lukaselmer closed 9 years ago
How have you tested this, is everything working? Sorry this repo doesn't have tests, it was a one-off project, so I have to rely on your thoroughness :)
How have you tested this?
I've installed the latest version of gensim and ran the contents of https://github.com/piskvorky/sim-shootout/blob/master/run_all.sh. Than pool.imap didn't work ("Too many values to unpack"), and I noticed that gensim changed the return parameters from 2 to 3.
Is everything working?
I don't know yet, sorry. The script is still running. There seem to be many invalid lines though, e.g. "INFO : invalid line at title List of municipalities in Espírito Santo").
Hmm, I don' think that should be happening, something must be wrong... Could it be some non-utf8 characters raising exceptions?
Hmm, ok. Yes, looks like. I'll look into it, thanks.
Thanks! I assume this fixes the "invalid line" INFO lines?
Exactly. I will notify you as soon as the script finishes.
I finally got the exception described here: https://groups.google.com/forum/#!msg/gensim/IJVpAtshWEA/jnMCsqx_sb8J
I will try to solve it and include it in the PR.
That will be great. Thanks for taking the time to clean this up, Lukas!
Couldn't do it if you didn't provide the code in the first place. Thanks! :+1:
It's looking good at the moment, but it'll take some time until it finishes. I'll keep you posted.
./prepare_shootout.py ./data/enwiki-latest-pages-articles.xml.bz2 ./data
2015-06-12 13:45:03,377 : INFO : running ./prepare_shootout.py ./data/enwiki-latest-pages-articles.xml.bz2 ./data
2015-06-12 13:45:03,381 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-06-12 13:46:23,876 : INFO : adding document #10000 to Dictionary(432846 unique tokens: [u'biennials', u'tripolitan', u'oblocutor', u'woode', u'maderista']...)
2015-06-12 13:47:34,274 : INFO : adding document #20000 to Dictionary(613459 unique tokens: [u'biennials', u'tripolitan', u'oblocutor', u'shatzky', u'woode']...)
2015-06-12 13:48:29,965 : INFO : adding document #30000 to Dictionary(745897 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:49:17,699 : INFO : adding document #40000 to Dictionary(863174 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:49:50,833 : INFO : adding document #50000 to Dictionary(940710 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:12,595 : INFO : adding document #60000 to Dictionary(958082 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:30,817 : INFO : adding document #70000 to Dictionary(975116 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:50:47,549 : INFO : adding document #80000 to Dictionary(989726 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:51:26,818 : INFO : adding document #90000 to Dictionary(1069301 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:52:11,558 : INFO : adding document #100000 to Dictionary(1164317 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:52:51,811 : INFO : adding document #110000 to Dictionary(1251317 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'dulcitone']...)
2015-06-12 13:53:29,382 : INFO : adding document #120000 to Dictionary(1328898 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'cyclophophamide']...)
2015-06-12 13:54:04,843 : INFO : adding document #130000 to Dictionary(1395108 unique tokens: [u'tripolitan', u'verplank', u'oblocutor', u'shatzky', u'cyclophophamide']...)
2015-06-12 13:54:43,592 : INFO : adding document #140000 to Dictionary(1470445 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:55:20,452 : INFO : adding document #150000 to Dictionary(1553531 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:55:55,882 : INFO : adding document #160000 to Dictionary(1625392 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:56:26,887 : INFO : adding document #170000 to Dictionary(1684152 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:56:58,508 : INFO : adding document #180000 to Dictionary(1734059 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:57:28,090 : INFO : adding document #190000 to Dictionary(1788162 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:57:56,621 : INFO : adding document #200000 to Dictionary(1842811 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:58:26,796 : INFO : adding document #210000 to Dictionary(1891262 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:58:56,738 : INFO : adding document #220000 to Dictionary(1934119 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:59:24,603 : INFO : adding document #230000 to Dictionary(1982979 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 13:59:55,508 : INFO : discarding 31792 tokens: [(u'drywells', 1), (u'ziegelheim', 1), (u'barakjay', 1), (u'marangell', 1), (u'fuentelencina', 1), (u'psarakia', 1), (u'nambaudus', 1), (u'plumosity', 1), (u'barij', 1), (u'hiccou', 1)]...
2015-06-12 13:59:55,508 : INFO : keeping 2000000 tokens which were in no less than 0 and no more than 240000 (=100.0%) documents
2015-06-12 14:00:00,236 : INFO : resulting dictionary: Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:00,282 : INFO : adding document #240000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:34,517 : INFO : discarding 40486 tokens: [(u'beatsuite', 1), (u'fu\u017ei', 1), (u'gyeongi', 1), (u'keser\xfc', 1), (u'tetyukhinsky', 1), (u't\xf6teberg', 1), (u'mccalligog', 1), (u'drammakins', 1), (u'dreadnort', 1), (u'cdci', 1)]...
2015-06-12 14:00:34,517 : INFO : keeping 2000000 tokens which were in no less than 0 and no more than 250000 (=100.0%) documents
2015-06-12 14:00:38,684 : INFO : resulting dictionary: Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
2015-06-12 14:00:38,734 : INFO : adding document #250000 to Dictionary(2000000 unique tokens: [u'tripolitan', u'ftdna', u'soestdijk', u'phintella', u'billycorgan']...)
Seems like it worked!
2015-06-13 01:27:05,819 : INFO : PROGRESS: saving document #3830000
2015-06-13 01:27:09,248 : INFO : PROGRESS: saving document #3831000
2015-06-13 01:27:12,572 : INFO : PROGRESS: saving document #3832000
2015-06-13 01:27:15,919 : INFO : PROGRESS: saving document #3833000
2015-06-13 01:27:16,531 : INFO : saved 3833255x500 matrix, density=100.000% (1916624364/1916627500)
2015-06-13 01:27:16,539 : INFO : saving MmCorpus index to ./data/lsi_vectors.mm.index
2015-06-13 01:27:17,709 : INFO : finished running prepare_shootout.py
Much obliged, merging :+1:
https://github.com/piskvorky/gensim/commit/6783b813408acc4e04ebe0603192c0d76508b048#diff-eece52d95c280dabe57c803c95d6bb96 introduced an additional value pageid which breaks this code (Too many values to unpack)