Open loretoparisi opened 8 years ago
[UPDATE]
I have found another vocabulary dump directly from the Google's GoogleNews-vectors-negative300.bin
file, exploded in 30 files of 100K words (Google has 3 million words in the features vector) that are in the vocabulary
folder of this repo
https://github.com/loretoparisi/inspect_word2vec
The vocabulary structure is like
$ head -n 10 vocabulary_01.txt
Allanah_Munson
WINDS_WILL
nab_sexual_predators
By_Alexandra_Barham
Mayor_Noramie_Jasmin
Chief_Executive_Glenn_Tilton
Neil_Kinnock
Makoto_Tamada_JPN_Konica
abductor_muscle
visit_www.availability.sungard.com
so we have commonly paired words
as well and stopwords
.
Can I use this vocabulary as read_vocab
input then?
Hey loretoparisi, there is an example:
set size=300
set text=test_version
set read_vocab=%text%_traning_data.txt
set train_file=%text%_vocab_data.txt
set binary=1
set cbow=1
set alpha=0.01
set epoch=20
set window=5
set sample=0
set hs=0
set negative=5
set threads=16
set mincount=5
set sw_file=stopwords_simple.txt
set stopwords=0
set data_block_size=1000
set max_preload_data_size=2000
set use_adagrad=0
set is_pipeline=0
set output=%text%_%size%.bin
distributed_word_embedding.exe -max_preload_data_size %max_preload_data_size% -is_pipeline %is_pipeline% -alpha %alpha% -data_block_size %data_block_size% -train_file %train_file% -output %output% -threads %threads% -size %size% -binary %binary% -cbow %cbow% -epoch %epoch% -negative %negative% -hs %hs% -sample %sample% -min_count %mincount% -window %window% -stopwords %stopwords% -sw_file %sw_file% -read_vocab %read_vocab% -use_adagrad %use_adagrad%
you can see text
is a parameter in bat file and you train_file
is the path that you put training file.
As for input format of train_file
, it could be raw English file or other language file that after segmentation(it's more better that you have remove some noise).
As for the input format of vocab_file
, you can check Preprocess in new repo.
or you don't want to use this code , make sure that your vocab_file
like that(seperated by spaces):
word_name_1 word_frequency_1
word_name_2 word_frequency_2
...
word_name_n word_frequency_n
Best wishes.
@xuehui1991 thank you for your answer, one thing I kindly ask you. Is the word_frequency_1
the term frequency expressed as tf-idf
form ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf )?
I can see a new WordEmbedding
project in theMultiverso
framework - https://github.com/Microsoft/Multiverso/tree/master/Applications/WordEmbedding
Do we have to use that one as DMTK
word embedding now?
Thank you.
I think word_frequency_1 means the word count of dataset, no need to use tf-idf. As for the repo, I think both of them are ok.
The "run.bat" script has the options
text
,read_vocab
andtrain_file
.https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
is it ok?text
a specific file in thetrain_file
folder?How to generate/retrieve a
enwiki
vocabulary file?Is it ok to process the dump and generating a vocabulary like the one made by the script mkvocab.pl