pommedeterresautee / fastrtext

R wrapper for fastText
https://pommedeterresautee.github.io/fastrtext/
Other
101 stars 15 forks source link

Predict error #24

Closed gabrielwong1991 closed 5 years ago

gabrielwong1991 commented 6 years ago

Hi,

I am using fastrtext to label some twitter messages. I wish to use supervised learning, get the model and use it to predict some other twitter messages. Luckily some people helped me classified the twitter messages into two labels.

Since the characteristics of twitter message is different for each message, so one cannot include all the words in the trained model.

After I try to get the prediction: predictions <- predict(model, sentences = test_to_write)

I get this message which I believe is normal because I didn't have all the words in my model:

Error: Some sentences have no predictions. It may be caused by the fact that all their words are have not been seen during the training.

So after getting this error, I would like to at least view those message that is being classified. Is there a way to do this? The "predictions" object does not exist due to the error being raised.

Lastly, for this type of data like twitter, is there a way to regex clean the expression? For example in twitter message. Yeah can be written as yh, yea etc... and obviously these are three different words totally.

Many thanks.

pommedeterresautee commented 6 years ago

Look at the documentation of the prediction method ?fastrtext::predict.Rcpp_fastrtext :

unlock_empty_predictions    
logical to avoid crash when some predictions are not provided for some sentences because all their words have not been seen during training. This parameter should only be set to TRUE to debug.

Regarding the second question, Fasttext uses subwords if you use ngram parameter, it will understand that coooool and cooooooool are in some way related.

gabrielwong1991 commented 6 years ago

Thanks for the reply. I just installed linux and use the gcc compiled fasttext base. It is very fast to load the pretrainedVector in 10 seconds, where as in fastrtext it takes 2 hours... In fact the pretrainedVector is made from fastRtext and i just copy and paste the file to there.

Also with the same params and supervised model I don't get:

Error: Some sentences have no predictions. It may be caused by the fact that all their words are have not been seen during the training.

Unfortunately I cannot make a reproducible problem so if any one have this problem should raise this up pls.

hack-r commented 5 years ago

This is helpful! Thanks for unlock_empty_predictions

I wish there were more diagnostics because I always get this error even with 90% training and 10% test. I guess I thought it would be less sensitive to some small variations especially with the out-of-vocab aspect of its toolkit/internals?

Or do most people just filter on the results of fastTextR::get_words(training_data) ?

pommedeterresautee commented 5 years ago

myself, I just use minn and maxn options. Using subwords makes you having an embedding for everything

pommedeterresautee commented 5 years ago

New version pushed to Cran. Closing.