microsoft / distributed_word_embedding

Distributed word embedding
MIT License
137 stars 61 forks source link

Fetch enwiki train and vocab files #9

Open loretoparisi opened 8 years ago

loretoparisi commented 8 years ago

The "run.bat" script has the options text, read_vocab and train_file.

How to generate/retrieve a enwiki vocabulary file?

Is it ok to process the dump and generating a vocabulary like the one made by the script mkvocab.pl

THE     84503449
AND     33700692
WAS     12911542
FOR     10342919
THAT    8318795
loretoparisi commented 8 years ago

[UPDATE]

I have found another vocabulary dump directly from the Google's GoogleNews-vectors-negative300.bin file, exploded in 30 files of 100K words (Google has 3 million words in the features vector) that are in the vocabulary folder of this repo

https://github.com/loretoparisi/inspect_word2vec

The vocabulary structure is like

$ head -n 10 vocabulary_01.txt 
Allanah_Munson
WINDS_WILL
nab_sexual_predators
By_Alexandra_Barham
Mayor_Noramie_Jasmin
Chief_Executive_Glenn_Tilton
Neil_Kinnock
Makoto_Tamada_JPN_Konica
abductor_muscle
visit_www.availability.sungard.com

so we have commonly paired words as well and stopwords. Can I use this vocabulary as read_vocab input then?

xuehui1991 commented 8 years ago

Hey loretoparisi, there is an example:

set size=300
set text=test_version
set read_vocab=%text%_traning_data.txt
set train_file=%text%_vocab_data.txt
set binary=1
set cbow=1
set alpha=0.01
set epoch=20
set window=5
set sample=0
set hs=0
set negative=5
set threads=16
set mincount=5
set sw_file=stopwords_simple.txt
set stopwords=0
set data_block_size=1000
set max_preload_data_size=2000
set use_adagrad=0
set is_pipeline=0
set output=%text%_%size%.bin

distributed_word_embedding.exe -max_preload_data_size %max_preload_data_size% -is_pipeline %is_pipeline% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad%

you can see text is a parameter in bat file and you train_file is the path that you put training file.

As for input format of train_file, it could be raw English file or other language file that after segmentation(it's more better that you have remove some noise).

As for the input format of vocab_file, you can check Preprocess in new repo.

or you don't want to use this code , make sure that your vocab_file like that(seperated by spaces):

word_name_1 word_frequency_1
word_name_2 word_frequency_2
...
word_name_n word_frequency_n

Best wishes.

loretoparisi commented 8 years ago

@xuehui1991 thank you for your answer, one thing I kindly ask you. Is the word_frequency_1 the term frequency expressed as tf-idf form ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf )? I can see a new WordEmbedding project in theMultiverso framework - https://github.com/Microsoft/Multiverso/tree/master/Applications/WordEmbedding Do we have to use that one as DMTK word embedding now?

Thank you.

xuehui1991 commented 8 years ago

I think word_frequency_1 means the word count of dataset, no need to use tf-idf. As for the repo, I think both of them are ok.