vinhkhuc / JFastText

Java interface for fastText
Other
228 stars 100 forks source link

why JFastText allowed only model trained with JFastext? #28

Open ali3assi opened 6 years ago

ali3assi commented 6 years ago

Hello,

How can read a pretrained model? I try to load the preexisting files .vec and .bin, but the load model raises an excpetion. Its looks like the format incompatible and JFastText allowed only model trained with JFastext.

lidalei commented 6 years ago

You can upgrade the fastText within the cpp folder to the released version. Then run mvn clean install. The compiled jar package with dependency will be compatible with newer pre-trained models.

xikunlun001 commented 6 years ago

@lidalei I got errors in upgrade fastText as below. Could you check this in your convenience time? Thanks.

In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102: In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13: /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper.cc:83:18: warning: 'getVector' is deprecated: getVector is being deprecated and replaced by getWordVector. [-Wdeprecated-declarations] fastText.getVector(vec, word); ^ /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/fasttext.h:63:3: note: 'getVector' has been explicitly marked deprecated here FASTTEXT_DEPRECATED( ^ /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/utils.h:15:50: note: expanded from macro 'FASTTEXT_DEPRECATED'

define FASTTEXT_DEPRECATED(msg) attribute((deprecated(msg)))

                                             ^

In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102: In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13: /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttextwrapper.cc:84:38: error: 'data' is a protected member of 'fasttext::Vector' return std::vector(vec.data, vec.data + vec.m); ^ /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/vector.h:26:23: note: declared protected here std::vector data; ^ In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102: In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13: /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttextwrapper.cc:84:49: error: 'data' is a protected member of 'fasttext::Vector' return std::vector(vec.data, vec.data + vec.m); ^ /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fastText/src/vector.h:26:23: note: declared protected here std::vector data; ^ In file included from /Users/xichen/Desktop/NLP project/JFastText/target/classes/com/github/jfasttext/jniFastTextWrapper.cpp:102: In file included from /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttext_wrapper_javacpp.h:13: /Users/xichen/Desktop/NLP project/JFastText/src/main/java/../cpp/fasttextwrapper.cc:84:61: error: no member named 'm' in 'fasttext::Vector' return std::vector(vec.data, vec.data + vec.m_);

lidalei commented 6 years ago

'getVector' is deprecated: getVector is being deprecated and replaced by getWordVector. Besides, class Vector was rewritten. You cannot access data or m member of a vector. Instead, you have to use vector.data() and vector.size(). I'd suggest have a look at my fork https://github.com/lidalei/JFastText

xikunlun001 commented 6 years ago

@lidalei Thanks. I just used the code in your fork but got the following error in loadModel as following, could you have a look:

Exception in thread "main" java.lang.UnsatisfiedLinkError: com.github.jfasttext.FastTextWrapper$FastTextApi.checkModel(Ljava/lang/String;)Z at com.github.jfasttext.FastTextWrapper$FastTextApi.checkModel(Native Method) at com.github.jfasttext.JFastText.loadModel(JFastText.java:29) at com.github.jfasttext.JFastText.main(JFastText.java:203)

lidalei commented 6 years ago

Could you release you code?

ali3assi commented 6 years ago

Hello Sir @lidalei I just install your code : https://github.com/lidalei/JFastText

I take the generated two jar JFastText/target/ and added them to buildinf path in eclipse.

In my testDriver method i declared:

import com.github.jfasttext.JFastText;

public class TestDriver {

    public static void main(String[]args){
        JFastText jft = new JFastText();

    }
}

So, runing the code i get the follwing exception:

Exception in thread "main" java.lang.UnsatisfiedLinkError: no jniFastTextWrapper in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
    at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    at java.lang.System.loadLibrary(System.java:1122)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1191)
    at org.bytedeco.javacpp.Loader.load(Loader.java:953)
    at org.bytedeco.javacpp.Loader.load(Loader.java:854)
    at com.github.jfasttext.FastTextWrapper.<clinit>(FastTextWrapper.java:11)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.bytedeco.javacpp.Loader.load(Loader.java:913)
    at org.bytedeco.javacpp.Loader.load(Loader.java:854)
    at com.github.jfasttext.FastTextWrapper$FastTextApi.<clinit>(FastTextWrapper.java:442)
    at com.github.jfasttext.JFastText.<init>(JFastText.java:23)
    at TestDriver.main(TestDriver.java:6)

Any idea how to solve this issue please?

lidalei commented 6 years ago

You should merely use jfasttext-0.1.0-jar-with-dependencies.jar, which can be generated by running mvn clean install.

lidalei commented 6 years ago

Btw, you should clone the subfolder 'src/main/cpp/fastText' to compile a native library. @TamouzeAssi @xikunlun001 https://github.com/lidalei/fastText

ali3assi commented 6 years ago

Unfortunately, it is not working under windows

ali3assi commented 6 years ago

The problem still existing. We try to load pre-trained model, When we read this model by jft.loadModel(path/to/pretarined_model) we get the following exception

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Model file's format is not compatible with this JFastText version!

Note that we get the the new fork of JFastText then we delete the file in src/cpp/ fasttext and clone agian this fasttext then run mvn clean install.

Any idea on solving this problem

lidalei commented 6 years ago

@TamouzeAssi Model file's format is not compatible with this JFastText version means you should train your model with the corresponding fastText or JFastText. Don't use pip to install fasttext which is not official. Follow this to install Python binding, https://github.com/facebookresearch/fastText/tree/master/python.

ali3assi commented 6 years ago

@lidalei Thank you first for your cooperation. I will try to clone the fastext in your mentionned link to the subfolder cpp in JFastext please correct me if im wrong.

I want to use the pretrained model bi the library fasttext like wiki.en. So this model trained by fastext which is different from JFasttext. I dont want to train again due to several reason. Thank you

lidalei commented 6 years ago

You can download word embeddings from https://fasttext.cc/docs/en/pretrained-vectors.html. I haven't tried but believe they work. JFastText relies on fastText. If JFastText complains, it means the model was trained with a non-compatible version fastTex with the fastText JFastText is using.

ali3assi commented 6 years ago

@lidalei Sorry but still not working. The same exception is raised when i try to load word embeddings from l.

I clone your fork for JFastext then delete the folder cpp/fastText and clone again this file from where you said and then mvn clean install. and the exception still existing.

Can you please descrive the step or try to load a pretained model using JFastText?

lidalei commented 6 years ago

@TamouzeAssi I guess you were trying to load a word embedding. It cannot! Try to load a model from https://fasttext.cc/docs/en/language-identification.html.

lidalei commented 6 years ago

@TamouzeAssi If it did not work, try to use python interface of fastText to load your model.

ali3assi commented 6 years ago

@lidalei: the model lid.176.bin from https://fasttext.cc/docs/en/language-identification.html can be loaded in JFastext without any error.

But withe wiki.en from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md the model generate the incompatible file format.

By the way do you have any good reference to a model learned on wikipedia. Im looking to use the vector embedding to can cover the OOV. please

lidalei commented 6 years ago

@TamouzeAssi There is no problem with this. pretrained vectors are just a word embedding that represents a word as a vector. It does not do any classification task. A classifier is built on the word embedding. For example, you can represent a sentence as mean vector of its words's vectors and train a classifier to classify an unknown sentence.

I don't have a model for you. It really depends on your task. What do you want to achieve?

ali3assi commented 6 years ago

@lidalei for the moment i want just to compute the similarity between two sentences where some noise exists (miss typo). So, i used word2vec but i get bad result due to OOV so i go to use fastText to can use the subword information.

lidalei commented 6 years ago

@TamouzeAssi Have a look at https://radimrehurek.com/gensim/models/word2vec.html#module-gensim.models.word2vec

ali3assi commented 6 years ago

@lidalei i was used gensim to get the word2vec model but i developped my algo in java, and gensim can be used with java even we use jython language.

lidalei commented 6 years ago

@TamouzeAssi I will add the function to my JfastText repo and tell you as soon as I complete.

lidalei commented 6 years ago

@TamouzeAssi It won't help you soon. I'd suggest you check

void FastText::loadVectors(std::string filename) {
  std::ifstream in(filename);
  std::vector<std::string> words;
  std::shared_ptr<Matrix> mat; // temp. matrix for pretrained vectors
  int64_t n, dim;
  if (!in.is_open()) {
    throw std::invalid_argument(filename + " cannot be opened for loading!");
  }
  in >> n >> dim;
  if (dim != args_->dim) {
    throw std::invalid_argument(
        "Dimension of pretrained vectors (" + std::to_string(dim) +
        ") does not match dimension (" + std::to_string(args_->dim) + ")!");
  }
  mat = std::make_shared<Matrix>(n, dim);
  for (size_t i = 0; i < n; i++) {
    std::string word;
    in >> word;
    words.push_back(word);
    dict_->add(word);
    for (size_t j = 0; j < dim; j++) {
      in >> mat->at(i, j);
    }
  }
  in.close();

  dict_->threshold(1, 0);
  input_ = std::make_shared<Matrix>(dict_->nwords()+args_->bucket, args_->dim);
  input_->uniform(1.0 / args_->dim);

  for (size_t i = 0; i < n; i++) {
    int32_t idx = dict_->getId(words[i]);
    if (idx < 0 || idx >= dict_->nwords()) continue;
    for (size_t j = 0; j < dim; j++) {
      input_->at(idx, j) = mat->at(i, j);
    }
  }
}

and write some Java code to read pretrained vectors.

ali3assi commented 6 years ago

@lidalei Thank you i will try to write similar code. By the way let me know when you add the function to your JFastText repo please

renzherl commented 5 years ago

val fasttext = new JFastText() fasttext.loadModel("/home/work/XX/model/model.bin")

java.lang.IllegalArgumentException: Model file doesn't exist!