Closed kaishxu closed 3 years ago
Thanks for reaching out! First off, it looks like you're using v0.1 but you might find the v0.2 branch a lot richer in features and more systematic.
For Q1 and Q2, queries
(input to self.query(queries)
) and docs
(input to self.doc(docs)
) are not lists of words. They are batches (lists) of strings, and are then tokenized into word-piece. Also notice that our v0.1 branch uses HuggingFace Transformers version 2. The more recent branch uses Transformer 3, though.
For Q3, this is an implementation trick for concise code; the statement in the paper is correct. The reason for this is that the tokens appended to the document are masked twice: (a) their attention_mask is set to zero (this applies in general) and (b) in self.doc
lines 51--55 their output embeddings are also masked. Subsequently, during indexing, these embeddings are dropped entirely. This is applied to the document but not to the query. Thus, in fact, "Unlike queries, we do not append [mask] tokens to documents."
Let me know if you have any further questions!
Thanks a lot for such a quick reply!!!!!!!
For Q1 and Q3, sorry, I do not notice the "Transformers" version. I will read the v0.2 branch right now. For Q3, I understand your meaning but I still think the expression "Unlike queries, we do not append [mask] tokens to documents." is misleading. Actually, you append [mask] tokens, while they are masked like punctuation in the skip list.
Not really, the query encoder is augmented with [MASK]s whereas the document encoder is not.
The document encoder never "sees" the [mask] tokens: they're masked in input, attention, and output, just like padding. This particular branch just saves a few "if" statements by clever use of attention and MaxSim masks, which could be confusing though.
Hello, I've read the v0.2 code and found you have fixed the issue.
Another question is, Why do you sort the samples with 'maxlen' in a batch?
Indeed, v0.2 uses a more straightforward implementation for this so it's clearer. However, the behavior is identical to v0.1 as there was no 'issue' to fix.
The sorting by maxlen is just for efficiency during training, in case you use --accum N where N > 1. It helps reduce the amount of padding used for document representations. (This padding is what is masked and dropped, in the responses above. It's needed to allow batch processing of variable-length documents.)
WOW, it is such a precise pruning! Thank you for your reply. I learn a lot through reading!!!! I just roughly randomize samples before feeding them into the trainer.
Hello, in "lazy_batcher.py", the function "_load_collection" is defined as follows.
def _load_collection(self, path):
print_message("#> Loading collection...")
collection = []
with open(path) as f:
for line_idx, line in enumerate(f):
pid, passage, title, *_ = line.strip().split('\t')
assert pid == 'id' or int(pid) == line_idx
passage = title + ' | ' + passage
collection.append(passage)
however, it seems there is no 'title' in "collection.tsv" file.
check again? :) https://github.com/microsoft/MSMARCO-Passage-Ranking
That's right, MS MARCO doesn't have titles. But lazy_batcher isn't used with the official MS MARCO data. You can see that training.py automatically selects eager_batcher instead for this type of dataset.
Hi @okhat , if my dataset has title and i want to use EagerBatcher instead LazyBatcher (LazyBatcher seems more complicated than than EagerBatcher), is oke if my collection is already in (title + ' | ' + passage) format ? Can you explain more detail about when we should use LazyBatcher ?
Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".
the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens.