Closed montebhoover closed 5 years ago
How would user know about '_TransformedText' suffix to be put in when using NGramFeaturizer + WordEmbedding?
See WordEmbedding(columns='features_TransformedText') in the example below:
WordEmbedding(columns='features_TransformedText')
# WordEmbedding: pre-trained transform to generate word embeddings from microsoftml_scikit import FileDataStream, Pipeline from microsoftml_scikit.datasets import get_dataset from microsoftml_scikit.feature_extraction.text import NGramFeaturizer from microsoftml_scikit.internal.entrypoints._ngramextractor_ngram import n_gram from microsoftml_scikit.feature_extraction.text import WordEmbedding # data input (as a FileDataStream) path = get_dataset('infert').as_filepath() # TODO: Replace with auto-inference file_schema= 'sep=, col=id:TX:0 col=education:TX:1 col=age:R4:2 col=parity:R4:3 col=induced:R4:4 col=case:R4:5 col=spontaneous:R4:6 header=+' data = FileDataStream(path, schema=file_schema) # transform usage # TODO: Bug 146763 pipeline = Pipeline([ NGramFeaturizer(word_feature_extractor=n_gram(), output_tokens=True, columns={'features': ['id', 'education']}), WordEmbedding(columns='features_TransformedText') ]) # fit and transform features = pipeline.fit_transform(data) # print features print(features.head())
Originally noted by abgoswam here: https://msdata.visualstudio.com/AlgorithmsAndDataScience/_workitems/edit/149666
this is not an issue anymore since output_tokens is removed and output_token_column_name is introduced in ML.NET v1.1
output_tokens
output_token_column_name
How would user know about '_TransformedText' suffix to be put in when using NGramFeaturizer + WordEmbedding?
See
WordEmbedding(columns='features_TransformedText')
in the example below:Originally noted by abgoswam here: https://msdata.visualstudio.com/AlgorithmsAndDataScience/_workitems/edit/149666