Closed youchan closed 6 years ago
@kou Thank you for your cooperation.
vocabulary
is necessary for word2vec.
I rethink it is good to store vocabulary
by the client side.
Do we need {word1 => id1, word2 => id2, ...}
data structure?
We should consider generic API that can be used in many datasets for the data structre.
For example, we already have to_table
API. Users can get data as {:word => [word1, word2, ...], :id => [id1, id2, ...]}
(columner) structure.
We can consider for {word1 => id1, word2 => id2, ...}
data structure if we need the data structure.
@youchan Do you have any other concerns for my changes?
I have one concern. What markup should we use for dataset description. It should be handled as a new issue.
We need 2 data structures.
[id1, id2, ...]
description from chainer/datasets
def get_ptb_words():
"""Gets the Penn Tree Bank dataset as long word sequences.
`Penn Tree Bank <https://www.cis.upenn.edu/~treebank/>`_ is originally a
corpus of English sentences with linguistic structure annotations. This
function uses a variant distributed at
`https://github.com/wojzaremba/lstm <https://github.com/wojzaremba/lstm>`_,
which omits the annotation and splits the dataset into three parts:
training, validation, and test.
This function returns the training, validation, and test sets, each of
which is represented as a long array of word IDs. All sentences in the
dataset are concatenated by End-of-Sentence mark '<eos>', which is treated
as one of the vocabulary.
Returns:
tuple of numpy.ndarray: Int32 vectors of word IDs.
.. Seealso::
Use :func:`get_ptb_words_vocabulary` to get the mapping between the
words and word IDs.
"""
def get_ptb_words_vocabulary():
"""Gets the Penn Tree Bank word vocabulary.
Returns:
dict: Dictionary that maps words to corresponding word IDs. The IDs are
used in the Penn Tree Bank long sequence datasets.
.. seealso::
See :func:`get_ptb_words` for the actual datasets.
"""
Thanks.
full words as id list
[id1, id2, ...]
We can get it by ptb.to_table[:id]
.
vocabrary (word list not duplicated and convertible from id)
We don't have API for it. We should handle it as a new issue.
I'll merge this pull request for now. Thanks for your work!
word IDs
will be treated as matrix (Numo/NArray). So we need conversion table to word from ID.
And we need the number of vocabulary.
Thank you! I think that vocabulary will be good to treat by the client.
description from chainer/datasets
I see why you use reStructuredText. :-)
https://www.cis.upenn.edu/~treebank/ is a dead link. Can you report it to Chainer developers?
I've created #22. If we don't need to handle the use case in red-datasets, we can just close #22.
Penn Tree Bank <https://www.cis.upenn.edu/~treebank/>
is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed athttps://github.com/wojzaremba/lstm <https://github.com/wojzaremba/lstm>
, which omits the annotation and splits the dataset into three parts: training, validation, and test.c.f. https://github.com/chainer/chainer/blob/master/chainer/datasets/ptb.py