red-data-tools / red-datasets

A RubyGem that provides common datasets
MIT License
30 stars 25 forks source link

Add Penn Treebank #20

Closed youchan closed 6 years ago

youchan commented 6 years ago

Penn Tree Bank <https://www.cis.upenn.edu/~treebank/> is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed at https://github.com/wojzaremba/lstm <https://github.com/wojzaremba/lstm>, which omits the annotation and splits the dataset into three parts: training, validation, and test.

c.f. https://github.com/chainer/chainer/blob/master/chainer/datasets/ptb.py

youchan commented 6 years ago

@kou Thank you for your cooperation.

youchan commented 6 years ago

vocabulary is necessary for word2vec.

youchan commented 6 years ago

I rethink it is good to store vocabulary by the client side.

kou commented 6 years ago

Do we need {word1 => id1, word2 => id2, ...} data structure? We should consider generic API that can be used in many datasets for the data structre.

kou commented 6 years ago

For example, we already have to_table API. Users can get data as {:word => [word1, word2, ...], :id => [id1, id2, ...]} (columner) structure. We can consider for {word1 => id1, word2 => id2, ...} data structure if we need the data structure.

kou commented 6 years ago

@youchan Do you have any other concerns for my changes?

kou commented 6 years ago

I have one concern. What markup should we use for dataset description. It should be handled as a new issue.

youchan commented 6 years ago

We need 2 data structures.

  1. full words as id list [id1, id2, ...]
  2. vocabrary (word list not duplicated and convertible from id)
youchan commented 6 years ago

description from chainer/datasets

def get_ptb_words():
    """Gets the Penn Tree Bank dataset as long word sequences.

    `Penn Tree Bank <https://www.cis.upenn.edu/~treebank/>`_ is originally a
    corpus of English sentences with linguistic structure annotations. This
    function uses a variant distributed at
    `https://github.com/wojzaremba/lstm <https://github.com/wojzaremba/lstm>`_,
    which omits the annotation and splits the dataset into three parts:
    training, validation, and test.

    This function returns the training, validation, and test sets, each of
    which is represented as a long array of word IDs. All sentences in the
    dataset are concatenated by End-of-Sentence mark '<eos>', which is treated
    as one of the vocabulary.

    Returns:
        tuple of numpy.ndarray: Int32 vectors of word IDs.

    .. Seealso::
       Use :func:`get_ptb_words_vocabulary` to get the mapping between the
       words and word IDs.

    """
def get_ptb_words_vocabulary():
    """Gets the Penn Tree Bank word vocabulary.

    Returns:
        dict: Dictionary that maps words to corresponding word IDs. The IDs are
        used in the Penn Tree Bank long sequence datasets.

    .. seealso::
       See :func:`get_ptb_words` for the actual datasets.

    """
kou commented 6 years ago

Thanks.

full words as id list [id1, id2, ...]

We can get it by ptb.to_table[:id].

vocabrary (word list not duplicated and convertible from id)

We don't have API for it. We should handle it as a new issue.

I'll merge this pull request for now. Thanks for your work!

youchan commented 6 years ago

word IDs will be treated as matrix (Numo/NArray). So we need conversion table to word from ID. And we need the number of vocabulary.

youchan commented 6 years ago

Thank you! I think that vocabulary will be good to treat by the client.

kou commented 6 years ago

description from chainer/datasets

I see why you use reStructuredText. :-)

https://www.cis.upenn.edu/~treebank/ is a dead link. Can you report it to Chainer developers?

kou commented 6 years ago

I've created #22. If we don't need to handle the use case in red-datasets, we can just close #22.