loretoparisi / fasttext.js

FastText for Node.js
MIT License
192 stars 28 forks source link

Nearest Neighbor sometimes returns only 1 result #11

Closed Zorkling closed 6 years ago

Zorkling commented 6 years ago

Using nearest neighbor, every once in a while locally (and always when hosted on beanstalk with docker), the result set will only contain a single result. Running the same request again may return the full result set.

I identified the issue as this line self.dataAppendCallback = onDataCallback; where it should be onAppendDataCallback instead.

I cannot do a pull request at the moment, but may be able to in the future.

loretoparisi commented 6 years ago

@Zorkling thanks a lot, let me have a look and fix it!

Zorkling commented 6 years ago

@loretoparisi Thanks!

Just realized i stupidly left out the file and line number - lib/index.js, line 534, in the FastText.prototype.nn function

loretoparisi commented 6 years ago

Hello @Zorkling so I have tested it out. We first train the example Word2Vec model with default skipgram on the example sms spam/ham dataset:

$ cd fasttext.js/examples/
$ node word2vec
learn [ 'skipgram',
  '-input',
  '/var/folders/_b/szqwdfn979n4fdg7f2j875_r0000gn/T/491ed5af-d7fc-475c-817b-00466b432cf5.csv',
  '-output',
  '/Users/loretoparisi/Documents/Projects/AI/fasttext.js/examples/data/sms_model_w2v',
  '-wordNgrams',
  1,
  '-minCount',
  1,
  '-minCountLabel',
  1,
  '-minn',
  3,
  '-maxn',
  6,
  '-t',
  0.0001,
  '-bucket',
  2000000,
  '-dim',
  10,
  '-lr',
  0.01,
  '-ws',
  5,
  '-loss',
  'ns',
  '-lrUpdateRate',
  100,
  '-epoch',
  5,
  '-thread',
  8,
  '-verbose',
  4,
  '-neg',
  5 ]
{ W: 3706, L: 2 }
Read 0M words
Number of words:  3706
Number of labels: 2
Progress: 100.0% words/sec/thread:    7773 lr:  0.000000 loss:  4.133226 ETA:   0h 0m
exec:fasttext end.
exec:fasttext exit.
Train ended
labels [ 'spam', 'ham' ]
model unloaded.

At this point we run the nearest neighbor, remembering that I'm randomly taking words from the provided samples:

$ node nearest.js 
find Nearest Neighbor of "claim"
[
  {
    "word": "huai",
    "similarity": "0.869137"
  },
  {
    "word": "�..",
    "similarity": "0.859109"
  },
  {
    "word": "boy",
    "similarity": "0.858461"
  },
  {
    "word": "ave",
    "similarity": "0.84622"
  },
  {
    "word": "�2000",
    "similarity": "0.837389"
  },
  {
    "word": "culdnt",
    "similarity": "0.801518"
  },
  {
    "word": "you�re",
    "similarity": "0.800617"
  },
  {
    "word": "300",
    "similarity": "0.797127"
  },
  {
    "word": "e=",
    "similarity": "0.782993"
  },
  {
    "word": "�100",
    "similarity": "0.765529"
  }
]
find Nearest Neighbor of "representative"
[
  {
    "word": "jus",
    "similarity": "0.917662"
  },
  {
    "word": "far",
    "similarity": "0.847942"
  },
  {
    "word": "xclusive@clubsaisai",
    "similarity": "0.84633"
  },
  {
    "word": "wenwecan",
    "similarity": "0.819176"
  },
  {
    "word": "ab..",
    "similarity": "0.814577"
  },
  {
    "word": "activate",
    "similarity": "0.811364"
  },
  {
    "word": "italian",
    "similarity": "0.808164"
  },
  {
    "word": "free.",
    "similarity": "0.799071"
  },
  {
    "word": "invaders",
    "similarity": "0.775991"
  },
  {
    "word": "werethe",
    "similarity": "0.774594"
  }
]
model unloaded.

In the code in FastText.nn what happens it is this. I first define two callbacks onDataAppendCallback and onErrorDataAppendCallback, while the onDataCallback is merely a helper function here:

var onDataCallback = function (_data) {
                    var data = _data.split(/\n/);
                    data = data.slice(0, -1);//remove last
                    var res = [];
                    if (data && data.length) {
                        data.forEach(nearest => {
                            var el = nearest.split(/\s/);
                            if (el && el.length > 1) { // two fields
                                res.push({
                                    word: el[0],
                                    similarity: el[1]
                                });
                            }
                        });
                    }
                    return resolve(res);
                };
                var onDataAppendCallback = function (_data) {
                    res += _data;
                    if (res.indexOf('Query word?') > -1) { // end query
                        return onDataCallback(res);
                    }
                };
                var onErrorDataAppendCallback = function (data) {
                    if (self._options.debug) console.log(data);
                };

As you can see the onDataAppendCallback will call the onDataCallback when the string outputted by fasttext executable in the stdout will get a 'Query word?' text. Until then I'm just appending the output to the variable res. Could you please check that in your case as well this happens exactly in the same way? Because if it breaks after just one nearest, there could be the error... Thank you for your help!

Zorkling commented 6 years ago

Right, however there are no calls to onDataAppendCallback. The next line after the definition for onErrorDataAppendCallback you referenced above is

self.dataAppendCallback = onDataCallback;
self.onErrorDataAppendCallback = onErrorDataAppendCallback;

my understanding is that self.dataAppendCallback is the trigger that is intended to resolve the function. However it is bound to onDataCallback, instead of onDataAppendCallback.

loretoparisi commented 6 years ago

@Zorkling basically in this specific case I'm using the onDataAppendCallback as the stdout callback to append data:

var onDataAppendCallback = function (_data) {
                    res += _data;
                    if (res.indexOf('Query word?') > -1) { // end query
                        return onDataCallback(res);
                    }
                };

while the onDataCallback is the helper function that looks at the string, and check the exit condition in fact FastText outputs to stdout the terms and at the end a prompt to the user. Sorry for the naming confusion, I think that was the error in this case, while in other methods are used exactly as you said :)

loretoparisi commented 6 years ago

Closing this because too old. Please feel free to re-open if needed.