stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.89k stars 374 forks source link

about ColBERT(BertPreTrainedModel) #19

Closed kaishxu closed 3 years ago

kaishxu commented 3 years ago

Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".

  1. in the query() function, "queries" are word lists. So, it can not be input into the self.tokenizer.encode() function. The standard input for tokenizer.encode() should be text.
  2. in the doc() function,
    docs = [["[unused1]"] + self._tokenize(d)[:self.doc_maxlen-3] for d in docs]

    the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens.

  3. although in the paper it is said that "Unlike queries, we do not append [mask] tokens to documents.", in the code the encoding function is "_encode()" for both queries and docs with the same [mask] padding.
okhat commented 3 years ago

Thanks for reaching out! First off, it looks like you're using v0.1 but you might find the v0.2 branch a lot richer in features and more systematic.

For Q1 and Q2, queries (input to self.query(queries)) and docs (input to self.doc(docs)) are not lists of words. They are batches (lists) of strings, and are then tokenized into word-piece. Also notice that our v0.1 branch uses HuggingFace Transformers version 2. The more recent branch uses Transformer 3, though.

For Q3, this is an implementation trick for concise code; the statement in the paper is correct. The reason for this is that the tokens appended to the document are masked twice: (a) their attention_mask is set to zero (this applies in general) and (b) in self.doc lines 51--55 their output embeddings are also masked. Subsequently, during indexing, these embeddings are dropped entirely. This is applied to the document but not to the query. Thus, in fact, "Unlike queries, we do not append [mask] tokens to documents."

Let me know if you have any further questions!

kaishxu commented 3 years ago

Thanks a lot for such a quick reply!!!!!!!

For Q1 and Q3, sorry, I do not notice the "Transformers" version. I will read the v0.2 branch right now. For Q3, I understand your meaning but I still think the expression "Unlike queries, we do not append [mask] tokens to documents." is misleading. Actually, you append [mask] tokens, while they are masked like punctuation in the skip list.

okhat commented 3 years ago

Not really, the query encoder is augmented with [MASK]s whereas the document encoder is not.

The document encoder never "sees" the [mask] tokens: they're masked in input, attention, and output, just like padding. This particular branch just saves a few "if" statements by clever use of attention and MaxSim masks, which could be confusing though.

kaishxu commented 3 years ago

Hello, I've read the v0.2 code and found you have fixed the issue.

Another question is, Why do you sort the samples with 'maxlen' in a batch?

okhat commented 3 years ago

Indeed, v0.2 uses a more straightforward implementation for this so it's clearer. However, the behavior is identical to v0.1 as there was no 'issue' to fix.

The sorting by maxlen is just for efficiency during training, in case you use --accum N where N > 1. It helps reduce the amount of padding used for document representations. (This padding is what is masked and dropped, in the responses above. It's needed to allow batch processing of variable-length documents.)

kaishxu commented 3 years ago

WOW, it is such a precise pruning! Thank you for your reply. I learn a lot through reading!!!! I just roughly randomize samples before feeding them into the trainer.

kaishxu commented 3 years ago

Hello, in "lazy_batcher.py", the function "_load_collection" is defined as follows.

def _load_collection(self, path):
    print_message("#> Loading collection...")

    collection = []

    with open(path) as f:
        for line_idx, line in enumerate(f):
            pid, passage, title, *_ = line.strip().split('\t')
            assert pid == 'id' or int(pid) == line_idx

            passage = title + ' | ' + passage
            collection.append(passage)

however, it seems there is no 'title' in "collection.tsv" file.

Screen Shot 2021-01-04 at 11 44 01

check again? :) https://github.com/microsoft/MSMARCO-Passage-Ranking

okhat commented 3 years ago

That's right, MS MARCO doesn't have titles. But lazy_batcher isn't used with the official MS MARCO data. You can see that training.py automatically selects eager_batcher instead for this type of dataset.

hieudx149 commented 2 years ago

Hi @okhat , if my dataset has title and i want to use EagerBatcher instead LazyBatcher (LazyBatcher seems more complicated than than EagerBatcher), is oke if my collection is already in (title + ' | ' + passage) format ? Can you explain more detail about when we should use LazyBatcher ?