Model cannot be output if the raise the dimension of word vector

pommedeterresautee / fastrtext

R wrapper for fastText

https://pommedeterresautee.github.io/fastrtext/

Other

101 stars 15 forks source link

Model cannot be output if the raise the dimension of word vector #17

Closed chuangys closed 5 years ago

chuangys commented 6 years ago

Everything is okay within the default parameter setting. But when I raised the dimension of word vector too 200, or 300. The model training is still fast but hang at model output. Could you help to check it?

pommedeterresautee commented 6 years ago

Hi, can you provide some code? (one I can test, with some data)

I just tried

library(fastrtext)

data("train_sentences")
data("test_sentences")

# prepare data
tmp_file_model <- tempfile()

train_labels <- paste0("__label__", train_sentences[,"class.text"])
train_texts <- tolower(train_sentences[,"text"])
train_to_write <- paste(train_labels, train_texts)
train_tmp_file_txt <- tempfile()
writeLines(text = train_to_write, con = train_tmp_file_txt)

test_labels <- paste0("__label__", test_sentences[,"class.text"])
test_texts <- tolower(test_sentences[,"text"])
test_to_write <- paste(test_labels, test_texts)

# learn model
execute(commands = c("supervised", "-input", train_tmp_file_txt,
                     "-output", tmp_file_model, "-dim", 200, "-lr", 1,
                     "-epoch", 20, "-wordNgrams", 2, "-verbose", 1))

model <- load_model(tmp_file_model)
predict(model, sentences = test_sentences[1, "text"])

And had no issue...

Can you try -verbose 1 in your command line?

chuangys commented 6 years ago

@pommedeterresautee Your code is running well at my environment. So I have to correct my problem. Apply the same example data, and I use the pre-trained vector, than can reproduce the hang at model output issue.

Source code below:

library(fastrtext) data("train_sentences") data("test_sentences") tmp_file_model <- tempfile(); print(tmp_file_model); train_labels <- paste0("label", train_sentences[,"class.text"]) train_texts <- tolower(train_sentences[,"text"]) train_to_write <- paste(train_labels, train_texts) train_tmp_file_txt <- tempfile(); print(train_tmp_file_txt); writeLines(text = train_to_write, con = train_tmp_file_txt) execute(commands = c("supervised", "-input", train_tmp_file_txt, "-output", tmp_file_model, "-dim", 300, "-lr", 1, "-epoch", 300, "-wordNgrams", 2, "-verbose", 1, "-pretrainedVectors", "e:/baproject/data/pretrainedword2vec/wiki-news-300d-1M.vec"))

The wiki-news-300d-1M.vec download from facebookresearch pre-trained vector at below website. https://fasttext.cc/docs/en/english-vectors.html

pommedeterresautee commented 6 years ago

it may be related to RAM issue. Did you fixed it?

dockstreet commented 6 years ago

Hi - I'm having the same issue as @chuangys, it seems to hang on the larger vec file ? I have 16GB of RAM

pommedeterresautee commented 6 years ago

Have you some test code? Did you checked the RAM (model trained by Facebook are quite big).

dockstreet commented 6 years ago

I do.

execute(commands = c("supervised", "-input", "C:/Users/xxx/R/fasttext_test/train.txt", "-output", "C:/Users/xxx/R/fasttext_test/train.bin","-lr", 1, "-epoch", 50,"-wordNgrams", 2, "-verbose", 1 ))

This worked (while the Facebook one would not) - however I'm using pre trained vectors : https://github.com/jazzyarchitects/fasttext-node/raw/master/train.txt

Here is the RAM size

memory.limit() [1] 16204

Would you know of a larger example I could try with fastrtext to try that you know works with a pretrained vec from an external source? It may help clarify if it's my environment or not

datalee commented 6 years ago

hi，i have a question: the arguments of ' pretrainedVectors does not support the vec products by gensim ?thks

pommedeterresautee commented 6 years ago

@datalee what is the feature you are referring to?

datalee commented 6 years ago

@pommedeterresautee classification.

pommedeterresautee commented 6 years ago

pretrainedVectors is the text file produced by fasttext when you learn a model, whatever it is. I don't know the format of gensim but should not be hard to convert (word\tvector where each value is separated by a space).