Need a find and replace function for TermDocumentMatrix

Synaps3 commented 7 years ago

Should input a Term Document Matrix, and a list of lists (or some objects) which give a word to be the replacement followed by all the terms to be replaced.

For example we might use [["Tuburculosis", "TB", "the disease"]["BCG", "Bacillus Calmette–Guérin", "Guerin"]] to find any copies of "TB" or "the disease" and replace them with "Tuburculosis" and likewise replace mentions of Bacillus Calmette–Guérin with the acronym "BCG"

The output is a new temporary TDM with the terms replaced as if all mentions of TB had been "Tuberculosis" originally.

adamlhayes commented 7 years ago

Graham - are you already working on this function? It will probably be easier to use content_transformer() in conjunction with gsub() to modify the corpus prior to stemming rather than transform the tdm (by the time it's tdm'd, phrases like "Bacillus Calmette–Guérin" would already be lost).

Synaps3 commented 7 years ago

Hey Adam,

Yeah I think I send you the partial code, but even if I didn't earlier, I think changing it in the corpus may be a lot easier too. There are not straightforward functions for accessing the interior of a TDM.

I was hoping to avoid regenerating the TDM itself, but that may not be too bad of a price to pay. I'll look into doing it as part of the corpus if you haven't already.

Best, Graham

On Wed, Jun 28, 2017 at 12:00 PM, adamlhayes notifications@github.com wrote:

Graham - are you already working on this function? It will probably be easier to use content_transformer() in conjunction with gsub() to modify the corpus prior to stemming rather than the tdm (we would lose phrases with the tdm).

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ryscott5/eparTextTools/issues/1#issuecomment-311755609, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOtepo1JojSVwkURt-HZudfoB_UdEt-ks5sIqKwgaJpZM4NbCQJ .

ryscott5 / eparTextTools

Need a find and replace function for TermDocumentMatrix #1