pommedeterresautee / fastrtext

R wrapper for fastText
https://pommedeterresautee.github.io/fastrtext/
Other
101 stars 15 forks source link

Debugging sentences with no predictions #19

Closed alanault closed 6 years ago

alanault commented 6 years ago

Hi there,

Thanks for a great package - really enjoying trying this out.

Had a q: I'm getting the "sentences have no predictions" error. To get around this I've set unlock_empty_predictions = T so that I can see the output to debug as per the help file.

How do I go about debugging. Is there an easy way to see which input text fields are missing? The output is just a list of classes and probabilities, so wasn't sure the best way to investigate further?

Any tips? I looked for short tweets (i'm using tweets, so wondered if it was posts with just a single hashtag).

Many thanks

Alan

pommedeterresautee commented 6 years ago

Easiest thing to do is to tokenize the text having no prediciton. Then you extract the dico from fasttext and check that there is indeed no known word (if there are some there is an issue). On Twitter the text is not normalized so it may happen quite easily. An idea (not tested) would be to add ngrams.

alanault commented 6 years ago

That makes sense - I've created a unique set of tokens for the test set and then compare with the dictionary like so

test_dic <- stringr::word(test_texts) %>%
  unique()

missing_words <- which(sapply(test_dic, function(x) x %in% get_dictionary(model)) == FALSE)

Not the fastest approach, but does yield differences. As you say, with Tweets, there is always a large likelihood of differing posts, as the posts, sub-topics and hashtags are quite dynamic.

Wasn't quite clear what you mean by ngrams?

How does the underlying fasttext deal with this? Does it rely on every single post having no new words? I understood the unsupervised part "guessed" at the appropriate vector using the token's individual components?

pommedeterresautee commented 6 years ago

Fasttext needs to compute a representation to make the prediction. If all words are unknown it won't be able to generate such representation. I am wondering if using these options helps:

 -minn               min length of char ngram [3]
 -maxn               max length of char ngram [6]

What I think is that even if the word is unknown the ngram may be enough to get something. In case it really doesn't work, when there is no prediction, you can "predict" the biggest class.

Another possible option is to init the supervised model with an already learned model (unsup).

alanault commented 6 years ago

The min/max might be worth a try. Interestingly most of the issues are due to variable punctuation e.g. whoa.. whoa... whoa....

Killing all the punctuation works, however you also lose lots of information and the accuracy is quite poor.

I expected that if something was truly unknown, then a NA would be returned - as you suggest, the biggest class could be used. However, because it returns nothing, if you have 10 missing outputs, you can't tell where they are, meaning every prediction isn't useful as they're out of step with the input text.

I may investigate an unsupervised model and then see if we can introduce a supervised stage..

pommedeterresautee commented 6 years ago

For the NA thing, it is already possible by not using simplify and unlock_empty_predictions to get a list. Is it not enough?

alanault commented 6 years ago

I thought it would be - however I couldn't see a way of linking the predictions back to original input sentence.

e.g. if I input 1,000 sentences, and get 999 back, it's not clear which predictions relate to which input sentences. If the failed sentence was number 1, then potentially the rest are all incorrect?

pommedeterresautee commented 6 years ago

if you are simplify = FALSE, it should return a list of the same size but some slots would be empty.

alanault commented 6 years ago

Ah yes - I was simplifying to get a vector back which I was inserting into a dataframe/tibble.

Should relatively easy to identify these and add a default value, then replace to the most common class as necessary.

Thanks for your help!