red-data-tools / red-datasets

A RubyGem that provides common datasets
MIT License
30 stars 25 forks source link

Add Table#dictionary_encode #23

Closed kou closed 5 years ago

kou commented 5 years ago

GitHub: fix #22

@youchan How about this approach?

kou commented 5 years ago
  • full words as id list [id1, id2, ...]

penn_treebank.table.dictionary_encode(:word).ids

  • vocabrary (word list not duplicated and convertible from id)

penn_treebank.table.dictionary_encode(:word).values

penn_treebank.table.dictionary_encode(:word).value(id)

And we need the number of vocabulary.

penn_treebank.table.dictionary_encode(:word).size

kou commented 5 years ago

I've added Table#label_encode (like scikit-learn) and Table#dictionary_encode (like Apache Arrow). #label_encode is based on #dictionary_encode. #label_encode will be useful for just converting values to IDs. If we need to re-convert IDs to values, #dictionary_encode will be useful.