Open p-null opened 6 years ago
return self.__class__(self.instances[:max_instances])
good question, I think that would work as well! I'm not too sure why I wrote that first code, sorry :)
This code doesn't actually do anything functionally with the label counts, it's just used to print to the console the approximate count of each label.
In [1]: import csv
In [2]: field = list(csv.reader(['asdf,fddf,ddd']))[0]
In [3]: field
Out[3]: ['asdf', 'fddf', 'ddd']
In [4]: fields = [x for x in csv.reader(['asdf,fddf,ddd'])]
In [5]: fields
Out[5]: [['asdf', 'fddf', 'ddd']]
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
Mimicking Word Embeddings using Subword RNNs
hope that helps!
Thank you for replying! Still have some questions. I will keep updating as i go through the codes. :) 1. In instance.py line 80:
def _index_text(self, text, data_indexer):
"""
This function uses a Tokenizer and an input DataIndexer to convert a
string into a list of integers representing the word indices according
to the DataIndexer.
Parameters
----------
text: str
The label encodes the ground truth label of the Instance.
This encoding varies across tasks and instances.
Returns
-------
index_list: List[int]
A list of the words converted to indices, as tokenized by the
TextInstance's tokenizer and indexed by the DataIndexer.
"""
return self.tokenizer.index_text(text, data_indexer)
1). I thought the text
here is tokenized sentence, not the ground truth label written in the comment?
2). Why not directly using self.tokenizer.index_text() instead of using self._index_text() ?
Same question with _words_from_text()
def _words_from_text(self, text):
"""
This function uses a Tokenizer to output a
list of the words in the input string.
Parameters
----------
text: str
The label encodes the ground truth label of the Instance.
This encoding varies across tasks and instances.
Returns
-------
word_list: List[str]
A list of the words, as tokenized by the
TextInstance's tokenizer.
"""
return self.tokenizer.get_words_for_indexer(text)
Why not using self.tokenizer.get_words_for_indexer(text)
directly?
_words_from_text(self, text)
is defined, but seems never be used?
In data_indexer.py add_word_to_index()
Why returning the index of the word?
The only usage of this function is to add word to the vocabulary(word_indices). It seems unnecessary to return index?
In instance.pypad_sequence_to_length()
comment area:
The reason we truncate from the right by default is for
cases that are questions, with long set ups. We at least want to get
the question encoded, which is always at the end
If you wanna encode the question which is at the end of a sequence, you should truncate it from left?
1). I thought the text here is tokenized sentence, not the ground truth label written in the comment?
Whoops, you're right. looks like the docstring is wrong, sorry about that
2). Why not directly using self.tokenizer.index_text() instead of using self._index_text() ?
The base instance class defines a self._index_text(), and I thought it would be better to just keep a consistent API throughout. Some instances might have more elaborate definitions of this function.
3). _words_from_text(self, text) is defined, but seems never be used?
I took the instance API from https://github.com/allenai/deep_qa, I think that's part of the API there.
4) If you wanna encode the question which is at the end of a sequence, you should truncate it from left?
This was another copy-paste error from deep_qa, sorry. I that truncating from left makes more sense for QA, but in this code I preferred to truncate from right since I suspect that the subject / salient information of the sentence would be closer to the front. This is just my intuition though.
Hi, Nelson!
1.
There are always 2 different level returning things like,
return (word_indexed_text, character_indexed_text)
return {"words": words, "characters": characters}
When do you choose to return a dictionary or tuple?
2
. In sts_instance.py
:
return ((first_sentence_word_array, first_sentence_char_matrix,
second_sentence_word_array, second_sentence_char_matrix),
(np.asarray(self.label),))
Why do you return (np.asarray(self.label),) instead of np.asarray(self.label)?
Hi, I appreciate your sharing this project! It is a very thoughtful work and friendly to newers! I have some questions when reading the code.
1. In dataset.py Instead of writing:
Why don't you write:
return self.__class__(self.instances[:max_instances])
2. At dataset.py line 180:
Why do you do above process? Correct me if i am wrong: (FAKE EXAMPLE)
300 is the number of times a specific type of label appears I don't understand why all these steps mean for? How can this calculate labels counts?
3. In sts_instance.py line 108:
Instead of writing:
fields = list(csv.reader([line]))[0]
Why not writing:
fields = [x for x in csv.reader([line])]
I run a trial:
It seems
list(csv.reader(['asdf','fddf','ddd']))[0]
can't be used to count the number of fields? Correct me if i am wrong.4.Why create character-level dictionary/instance? When you tokenize the string 'Today has a good weather' to character level, the result would be ['T','o','d','a','y','h','a'............] The character-level seems can't be used to encode contextual information? Could you tell me some cases where character-level tokenization plays a difference?
Thank you!