microsoft / NimbusML

Python machine learning package providing simple interoperability between ML.NET and scikit-learn components.
Other
284 stars 63 forks source link

Improve documentation or API for WordEmbedding with NGramFeaturizer #30

Closed montebhoover closed 5 years ago

montebhoover commented 6 years ago

How would user know about '_TransformedText' suffix to be put in when using NGramFeaturizer + WordEmbedding?

See WordEmbedding(columns='features_TransformedText') in the example below:

# WordEmbedding: pre-trained transform to generate word embeddings

from microsoftml_scikit import FileDataStream, Pipeline
from microsoftml_scikit.datasets import get_dataset
from microsoftml_scikit.feature_extraction.text import NGramFeaturizer
from microsoftml_scikit.internal.entrypoints._ngramextractor_ngram import n_gram
from microsoftml_scikit.feature_extraction.text import WordEmbedding

# data input (as a FileDataStream)
path = get_dataset('infert').as_filepath()

# TODO: Replace with auto-inference
file_schema= 'sep=, col=id:TX:0 col=education:TX:1 col=age:R4:2 col=parity:R4:3 col=induced:R4:4 col=case:R4:5 col=spontaneous:R4:6 header=+'
data = FileDataStream(path, schema=file_schema)

# transform usage
# TODO: Bug 146763
pipeline = Pipeline([
    NGramFeaturizer(word_feature_extractor=n_gram(), output_tokens=True,
                     columns={'features': ['id', 'education']}),

    WordEmbedding(columns='features_TransformedText')
    ])

# fit and transform
features = pipeline.fit_transform(data)

# print features
print(features.head())

Originally noted by abgoswam here: https://msdata.visualstudio.com/AlgorithmsAndDataScience/_workitems/edit/149666

ganik commented 5 years ago

this is not an issue anymore since output_tokens is removed and output_token_column_name is introduced in ML.NET v1.1