nelson-liu / paraphrase-id-tensorflow

Various models and code (Manhattan LSTM, Siamese LSTM + Matching Layer, BiMPM) for the paraphrase identification task, specifically with the Quora Question Pairs dataset.
MIT License
325 stars 72 forks source link

Several questions about the code #69

Open p-null opened 6 years ago

p-null commented 6 years ago

Hi, I appreciate your sharing this project! It is a very thoughtful work and friendly to newers! I have some questions when reading the code.

1. In dataset.py Instead of writing:

new_instances = [i for i in self.instances]
return self.__class__(new_instances[:max_instances])

Why don't you write: return self.__class__(self.instances[:max_instances])

2. At dataset.py line 180:

        label_counts = [(label, len([x for x in group]))
                        for label, group
                        in itertools.groupby(labels, lambda x: x[0])]
        label_count_str = str(label_counts)
        if len(label_count_str) > 100:
            label_count_str = label_count_str[:100] + '...'
        logger.info("Finished reading dataset; label counts: %s",
                    label_count_str)

Why do you do above process? Correct me if i am wrong: (FAKE EXAMPLE)

>>label_counts 
[(0,300),(1,300),(2,300)]
>>label_count_str
'[(0,300),(1,300),(2,300)]'

300 is the number of times a specific type of label appears I don't understand why all these steps mean for? How can this calculate labels counts?

3. In sts_instance.py line 108:

Instead of writing: fields = list(csv.reader([line]))[0]

Why not writing: fields = [x for x in csv.reader([line])]

I run a trial:

>>> field = list(csv.reader(['asdf','fddf','ddd']))
>>> field
[['asdf'], ['fddf'], ['ddd']]
>>> field[0]
['asdf']

It seems list(csv.reader(['asdf','fddf','ddd']))[0] can't be used to count the number of fields? Correct me if i am wrong.

4.Why create character-level dictionary/instance? When you tokenize the string 'Today has a good weather' to character level, the result would be ['T','o','d','a','y','h','a'............] The character-level seems can't be used to encode contextual information? Could you tell me some cases where character-level tokenization plays a difference?

Thank you!

nelson-liu commented 6 years ago
  1. return self.__class__(self.instances[:max_instances])

good question, I think that would work as well! I'm not too sure why I wrote that first code, sorry :)

  1. labels counts

This code doesn't actually do anything functionally with the label counts, it's just used to print to the console the approximate count of each label.

  1. I think you got the input format wrong, it should be like this:
In [1]: import csv

In [2]: field = list(csv.reader(['asdf,fddf,ddd']))[0]

In [3]: field
Out[3]: ['asdf', 'fddf', 'ddd']

In [4]: fields = [x for x in csv.reader(['asdf,fddf,ddd'])]

In [5]: fields
Out[5]: [['asdf', 'fddf', 'ddd']]
  1. There's a pretty rich literature of using character-level models to get rid of out of vocabulary words / get meaningful representations for novel or rare words. If you're interested in this, these papers might give a good introduction:

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

Mimicking Word Embeddings using Subword RNNs

hope that helps!

p-null commented 6 years ago

Thank you for replying! Still have some questions. I will keep updating as i go through the codes. :) 1. In instance.py line 80:

    def _index_text(self, text, data_indexer):
        """
        This function uses a Tokenizer and an input DataIndexer to convert a
        string into a list of integers representing the word indices according
        to the DataIndexer.

        Parameters
        ----------
        text: str
            The label encodes the ground truth label of the Instance.
            This encoding varies across tasks and instances.

        Returns
        -------
        index_list: List[int]
           A list of the words converted to indices, as tokenized by the
           TextInstance's tokenizer and indexed by the DataIndexer.

        """
        return self.tokenizer.index_text(text, data_indexer)

1). I thought the text here is tokenized sentence, not the ground truth label written in the comment? 2). Why not directly using self.tokenizer.index_text() instead of using self._index_text() ?

  1. Same question with _words_from_text()

    def _words_from_text(self, text):
        """
        This function uses a Tokenizer to output a
        list of the words in the input string.
    
        Parameters
        ----------
        text: str
            The label encodes the ground truth label of the Instance.
            This encoding varies across tasks and instances.
    
        Returns
        -------
        word_list: List[str]
           A list of the words, as tokenized by the
           TextInstance's tokenizer.
        """
        return self.tokenizer.get_words_for_indexer(text)

    Why not using self.tokenizer.get_words_for_indexer(text) directly? _words_from_text(self, text) is defined, but seems never be used?

  2. In data_indexer.py add_word_to_index() Why returning the index of the word? The only usage of this function is to add word to the vocabulary(word_indices). It seems unnecessary to return index?

  3. In instance.pypad_sequence_to_length() comment area:

    
    
        The reason we truncate from the right by default is for
        cases that are questions, with long set ups. We at least want to get
        the question encoded, which is always at the end

If you wanna encode the question which is at the end of a sequence, you should truncate it from left?
nelson-liu commented 6 years ago

1). I thought the text here is tokenized sentence, not the ground truth label written in the comment?

Whoops, you're right. looks like the docstring is wrong, sorry about that

2). Why not directly using self.tokenizer.index_text() instead of using self._index_text() ?

The base instance class defines a self._index_text(), and I thought it would be better to just keep a consistent API throughout. Some instances might have more elaborate definitions of this function.

3). _words_from_text(self, text) is defined, but seems never be used?

I took the instance API from https://github.com/allenai/deep_qa, I think that's part of the API there.

4) If you wanna encode the question which is at the end of a sequence, you should truncate it from left?

This was another copy-paste error from deep_qa, sorry. I that truncating from left makes more sense for QA, but in this code I preferred to truncate from right since I suspect that the subject / salient information of the sentence would be closer to the front. This is just my intuition though.

p-null commented 6 years ago

Hi, Nelson! 1. There are always 2 different level returning things like, return (word_indexed_text, character_indexed_text) return {"words": words, "characters": characters} When do you choose to return a dictionary or tuple?

2 . In sts_instance.py:

            return ((first_sentence_word_array, first_sentence_char_matrix,
                     second_sentence_word_array, second_sentence_char_matrix),
                    (np.asarray(self.label),))

Why do you return (np.asarray(self.label),) instead of np.asarray(self.label)?