retextjs / retext

natural language processor powered by plugins part of the @unifiedjs collective
https://unifiedjs.com
MIT License
2.36k stars 93 forks source link

Cannot figure out how to modify an existing tree #33

Closed thom4parisot closed 8 years ago

thom4parisot commented 8 years ago

Hello,

thanks for retext — it is very elegant to use! Although I have managed to create a plugin which filters a tree, it seems the next plugin in the chain does not inherit of the changes of the previous plugin.

Here is my high level code:

      retext()
        .use(retextStopwords, { stopwords: stopwordsFr })
        .use(retextKeywords)
        .process(text, (err, file) => {
            if (err) {
            return reject(err);
          }

          resolve({
            keywords: file.namespace('retext').keywords,
            keyphrases: file.namespace('retext').keyphrases
          });
        });

retextKeywords is the retext-keywords plugin but because it does not catch french stopwords, I ought to modify the tree to remove them. Here is the code of retextStopwords:

'use strict';

const filter = require('unist-util-filter');

module.exports = (retext, options) => {
  const stopwords = options.stopwords;

  return (node, file, next) => {
    const tree = filter(node, node => {
      return !(node.type === 'TextNode' && stopwords.indexOf(node.value) > -1);
    });

    file.namespace('retext').tree = tree;

    next(null, tree, file);
  };
};

When I check tree, it indeed does not contain the TextNodes I wanted to remove. But these words are still taken in account by retextKeywords which is next in the use() chain.

Any tip or hint to perform this?

Thanks a lot :-)

wooorm commented 8 years ago

Yes, the Root node cannot be changed: it must always be the same object. You cannot change it by overwriting file.namespace('retext').tree.

You can however change all children in the Root in any way you like!

thom4parisot commented 8 years ago

Understood!

Problem is when I change the children, it seems to break the phrase detection as, I guess, the positions have changed?

'use strict';

const visit = require('unist-util-visit');

module.exports = (retext, options) => {
  const stopwords = options.stopwords;

  return (node, file) => {
    visit(node, 'WordNode', node => {
      node.children = node.children.filter(d => {
        return stopwords.indexOf(d.value) === -1;
      });
    });
  };
};

Does it mean I have to also iterate over the next sibling position and update the WordNode position start/end as well?

wooorm commented 8 years ago

Pfew; quite the problem.

First off: no, start and end are not used in retext-keywords, changing those shouldn’t fix/break anything.

Now, retext-keywords depends on retext-pos. With your code, the transformers run in order as expected: retext-stopwords, retext-pos, and retext-keywords.

However, I just now noticed you were talking about French. I think there lies the problem, retext-pos caters especially to English, and only words with certain parts of speech classifications are eligible for inclusion in the results. retext-keywords does not use stopwords, just POS tags.

As a consequence, I cannot come up with a solution for this other than a) create a French JavaScript POS tagger (extremely hard), or b) fork retext-keywords to also support words without POS tags and not occurring in a configurable list of stop-words (and with forking I mean I’ll accept it back into upstream if you’d PR).

I’m currently not in a position to dig in myself, but if you’re interested in working on this I can definitely advise and help out: it’s been a while since I touched the code though!

thom4parisot commented 8 years ago

So if I understand well, best solution would be to implement the stopwords directly into retext-keywords, correct?

Or shall I mark the stopwords TextNodes as non-relevant for POS?

Although something I do not understand, is why retext-keywords (retext-pos) breaks because I altered the tree.