Problems of generating Corpus file

idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby

601 stars 137 forks source link

Problems of generating Corpus file #23

Open zhq2009 opened 8 years ago

zhq2009 commented 8 years ago

Hello,

We are using prepare.sh to generate Corpus file, but the Corpus file we generate is empty, could you please give us some suggestion of how to solve the problem?

Thank you very much

dav009 commented 8 years ago

what language are you trying? can you paste the command you are running?

zhq2009 commented 8 years ago

Hello,

We are trying English wikipedia. The command we are running is sudo sh prepare.sh en_US /mnt/data/, actually prepare.sh runs everything, such as downloads files and compiles programs. We are wondering if we could get the executable programs directly. We were also experiencing compatibility problems and the generated corpus file is empty.

Thank you very much

zhq2009 commented 8 years ago

Hello,

We run the commands in prepare.sh manually and we get the corpus file successfully. We are currently train model using the corpus file, the massage we got from the command:

... Requirement already satisfied (use --upgrade to upgrade): requests in /usr/lib/python2.7/dist-packages (from smart-open>=1.2.1->gensim) Cleaning up... pid 13182's current affinity mask: ff pid 13182's new affinity mask: ff

and the program stays there for several hours, but the CPU usage is full.

We are wondering whether the program is running correctly and shall we wait until we get the results?

Thank you very much

dav009 commented 8 years ago

ZH, depending on the corpus size + number of dimensions, method(skipgram, cbow) it can take a long time, usually for the settings of the shared models it took around 4,5 hours. my advice is to let it run a few hours (at least 6).

Be aware that if you installed gensim manually, it might not be using all the cores. The script provided in this repo installs it such that it uses as many cores as possible.

The first stage of word2vec will only use a single core tho (gathering the vocabulary), the batches of matrix factorization are done in parallel using as many cores as possible.

zhq2009 commented 8 years ago

Hello,

We use the command "wiki2vec.sh corpus output/model.w2c 50 500 10" to generate model file, after program runs for 20 hours, we get error message "IOError: [Errno 2] No such file or directory: '/home/_/_/wiki2vec/wiki2vec-master/results/model.w2c.syn1neg.npy'".

Could you please give us some suggestions about how to solve the problem?

Thank you very much.

RishabGargeya commented 7 years ago

Hi, @zhq2009 was this issue ever resolved?

zhq2009 commented 7 years ago

Hello,

Yes, the problem was solved.

Thank you very much.

On Sat, Dec 31, 2016 at 4:14 AM, Rishab Gargeya notifications@github.com wrote:

Hi, @zhq2009 https://github.com/zhq2009 was this issue ever resolved?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/idio/wiki2vec/issues/23#issuecomment-269856811, or mute the thread https://github.com/notifications/unsubscribe-auth/ARKSjKEJNfE-sArEe0yNTJCe3iLEUqH1ks5rNhzfgaJpZM4JWUET .

matthewdparker commented 7 years ago

Hi, I'm having the same problem when I try to generate the Corpus file - the file keeps coming up empty. I'm running the following command:

sudo sh prepare.sh en_US ~/data

Do you know why this might be?

Thank you!

Aditi138 commented 7 years ago

Hi, I am also facing the same issue.

When I ran the following snippet from gensim.models import Word2Vec model = Word2Vec.load("path/to/word2vec/en.model") model.similarity('woman', 'man'), I got the following error

" array.shape = shape ValueError: cannot reshape array of size 108 into shape (1151090,1000)"

Next when I run "sudo sh prepare.sh en_US ~/data", the corpus file is empty. Could that be related, and if not how to solve these 2 issues?