stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
3.05k stars 388 forks source link

Instructions on using ColBERT #73

Closed puzzlecollector closed 2 years ago

puzzlecollector commented 3 years ago

@okhat I am trying to use ColBERT for a document retrieval project I am working on and I'd like to ask if I have understood the procedure correctly. I am trying to perform a ranking task based on the similarity of passages. So if a query passage comes in, then among the document passages I have, the system has to retrieve the top-K most similar documents to the query.

colbert_e2e = pyterrier_colbert_factory.end_to_end() (colbert_e2e % 10).search()



Does this step look about right? I will proceed as I have written above and if I encounter any problems I will ask again in this thread. Thank you :) 
puzzlecollector commented 3 years ago

@okhat

Another question - when I use the following command to train the ColBERT model on my custom .tsv dataset

python -m colbert.train --amp --doc_maxlen 512 --query_maxlen 512 --mask-punctuation --bsize 8 --accum 1 --triples train_dataset.tsv 

I get a log that is being printed in the following format

Screen Shot 2021-11-08 at 4 12 37 PM

I wonder what each of these numbers represent. So if I inspect just this line from the log

Screen Shot 2021-11-08 at 4 13 29 PM

I am guessing that 1024 is the current number of training steps, 0.8598... is the loss, but I am unsure what 364.25 and 300.33 and 63.92... mean.

okhat commented 3 years ago

Hi @puzzlecollector ! Thanks for your questions.

You can use the pyterrier wrapper indeed, or you can check out our new_api branch, which I'd guess is more powerful but perhaps slightly less stable at this stage. It comes with a few perks, like easier indexing and much smaller index sizes due to internal compression of the embeddings. You can use it with the same checkpoint you're training now if you like, but it can't work with old indexes.

The steps look fine to me, though I've never used pyterrier before. However, your --query_maxlen 512 seems a bit excessive to me. Are you sure you need queries that long? I strongly advise 32 or 64 for this, and unless your passages really need to be that long, perhaps 256 is a better choice than 512 in practice considering efficiency. I'd increase the batch size to 32, using accum 2 or accum 4 if needed.

The logs print a sample of the scores of positive and negative passages, so in this case 364 and 300 are those scores. Training tries to maximize the gap between those essentially, which is the 63.92.

puzzlecollector commented 3 years ago

@okhat Thanks for the kind answer :) I am trying to use ColBERT for a document similarity task. So given a query, I want to rank the top-K documents in the corpus that is similar to the query document. And these documents are pretty long, so I need to chunk them into smaller pieces, so maybe I will just chunk them into 256 token lengths and use batch size of 32 (with accum 2 or 4) as you suggested.

Also I will try and use the ColBERT repo's indexing and retrieval functionalities. I will follow up with a few more questions if I run into any problems during this process so please don't close this thread yet! Thanks :)

okhat commented 3 years ago

Awesome! When dividing docs to passages, make sure you handle the distinction between words and BERT tokens. In other words, 256 words is often much larger than 256 BERT tokens. The script docs2passages has support for both ways of dividing docs.

puzzlecollector commented 3 years ago

@okhat In readme.md there are two methods of indexing

CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--collection /path/to/MSMARCO/collection.tsv \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--root /root/to/experiments/ --experiment MSMARCO-psg

and a second method (using FAISS I guess?)

python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg

and it says that I should use the second method for end-to-end retrieval (which is what I actually want to do anyways). So I guess I would have to run the first method first (thus creating indexes) then run the second method to apply FAISS to the created indexes from the first step right?

Afterwards I guess I can create a queries.tsv file (with query id and queries) for retrieval using the following command

python -m colbert.retrieve --amp --doc_maxlen 256 --query_maxlen 256 --mask_punctuation --bsize 256 --queries <path to queries> --nprobe 32 --partitions 327688 --faiss_depth 1024 --index_root <path to index> --index_name <index name> --checkpoint <path to checkpoint> 
okhat commented 3 years ago

This all looks good to me, but you might want to remove --partitions, and let that be decided automatically by the code.

If you want to configure it yourself, then you can use sqrt(doc_maxlen * # of documents) as a good choice. You can round that to the nearest power of two if you want.

puzzlecollector commented 3 years ago

@okhat
Thanks for the prompt reply! Yes I'll do that. I actually don't really know what some of the parameters mean, so is there some documentation that I could maybe refer to?

I guess the arguments like --partitions --nprobe --faiss_depth etc are all related to how faiss indexes and retrieves?

puzzlecollector commented 3 years ago

@okhat

I have a few more questions

  1. I am curious as to how the loss is actually computed. In the paper it says "ColBERT is used to produce a score for each document individually and is optimized via softmax cross-entropy loss over the computed scores of d+ and d-". So judging from this I guess the score between q and d+ is computed via the MaxSim operations described in the paper (and so is the score between q and d-) and after that we compute a softmax of the scores of (q,d+) and (q,d-) and afterwards compute binary crossentropy loss?

  2. I have a question about this ColBERT architecture diagram (I want to clarify if my understanding is correct) image

So here let's say if the query is "what passage talks about movies?" and upon tokenization we get [''what', 'passage', 'talks', 'about', 'movies', '?']. Then for each of these tokens the query encoder f_Q calculates the embeddings and each of these embeddings are "MaxSimmed" with all of the token embeddings of the document (extracted by the document encoder) right? And afterwards the MaxSim values for each of the query tokens are summed to produce the final score. And typically cosine similarity or squared L2 distance is used for MaxSim.

  1. As aforementioned I am planning to tackle a document similarity task using ColBERT. And my documents are pretty long. So far, I guess the only work around is to chunk the query and documents into smaller pieces. So if I have a triplet (query, positive, negative), then I will split each of them into n parts, such that we have (query_1, positive_1, negative_1), (query_2, positive_2, negative_2), ..., (query_n, positive_n, negative_n) where len(query_i) < 512 tokens (same goes for positives_i, negatives_i). However, I am afraid if this chunking method may result in significant information loss and thus degrade performance. Also another concern is that splitting a single datapoint into n parts will multiply the size of my dataset and training time will increase dramatically. I wonder if you can suggest me any other solution to this long document problem? A few methods I am thinking are
    • Use BigBird instead of BERT for encoders (but this does not really solve my problem - there are texts that are longer than 4096 tokens)
    • Use some abstractive summarization method (for example Pegasus? I am not an expert in NLP/summarization in general so I'm not so sure...) to shorten the query and the document and then apply ColBERT for ranking. But summarizing can also bring about information loss.
okhat commented 2 years ago

Your descriptions are correct on a quick skim. For Q3, I can assure you that dividing documents into passages works in every scenario I've seen. It should work in your use case too unless you have unusually important long-range dependencies. Long queries are a bit harder to reason about.

But my advice is: it's always easier to iterate than to put up a perfect design upfront. I'd start simple and see what failures appear. The simplest thing that would probably work very well already is to truncate the queries and the documents. Then you can divide the passages, and continue to truncate the queries. Then you can see how to divide the queries as well, to avoid truncation.

I'd stay away from changing the encoders or complex summarization pipelines. These bring their own challenges!

okhat commented 2 years ago

To add to this, I'm happy to help you iterate based on what you observe! I just encourage you to run a simple solution (not necessarily the one I described) and see how it works and where it fails.

puzzlecollector commented 2 years ago

@okhat Thanks I'll try and will let you know :)

puzzlecollector commented 2 years ago

@okhat Is there a way to print the validation loss during training as well?

puzzlecollector commented 2 years ago

@okhat I'm sorry for bombarding you with questions, but here's another one: When slicing the long input texts with sliding window, typically people seem to assume that the query is short and the document is long. So in most of the papers I recently read, it seems like the document is cut into smaller passages with a sliding window of m tokens (something like 225 tokens) and a stride of k (something like 200) for instance. So formally we can write the document D = {p_1, p_2, ..., p_n}. Then for each passage p_i, we calculate the similarity (or relevance) score with the query: s_1 = score(q, p_1), s_2 = score(q, p_2), ..., s_n = score(q, p_n). And finally some kind of score aggregation takes place such as the max or the mean score.

final_score = MAX(s_1, s_2, ..., s_n)

or

final_score = (s_1 + s_2 + ... + s_n) / n

However, what happens if the query is also long and it has to be sliced as well? This is exactly the problem I am facing since I am trying to calculate similarity scores between documents. So the query and the document are both long texts. In this case suppose the query is Q and the document is D. Then let's say I slice both the query and the document with sliding window of 225 tokens and stride of 200 tokens. Then suppose I managed to get Q = {q_1, ..., q_n} and also D = {p_1, ..., p_m}. So the query Q is split up into n passages, whereas the document D is split up into m passages. Then what is the logical thing to do afterwards to calculate the final score? The method I was thinking of is the following:

s_1 = MAX( score(q_1, p_1), score(q_1, p_2), ..., score(q_1, p_m) ) s_2 = MAX( score(q_2,p_1), score(q_2,p_2), ..., score(q_2, p_m) ) ... s_n = MAX( score(q_n,p_1), score(q_n,p_2), ..., score(q_n, p_m) )

and the final score is calculated as follows S = MAX(s_1, s_2, ..., s_n)

I'm sure I can do a mean aggregation or some other methods (there is a paper called PARADE I think it is a good approach to score aggregation so I might refer to this paper). But anyways that is the basic idea I am thinking of implementing. Could you let me know if this method seems valid? or is there any other approach that you could suggest I try?

Thanks :)

okhat commented 2 years ago

This is a very nice discussion of the problem. Let's start simple @puzzlecollector ! How does the first passage of the query do? Presumably, it gets you very good results, as documents are often well-summarized by their first few sentences.

If yu want to build a reasonably advanced solution, I'd say you can encode the query with several envocations to the encoder, concatenate the output embeddings, then do a single "MaxSim" interaction.

Does that make sense? This basically says that the interaction happens for the full query at once, but the encoding can happen over the slices. The same logic can apply to the documents too. But starting simple will tell you whether you need this at all!

puzzlecollector commented 2 years ago

@okhat I am dealing with patent documents (specifically, using the claims part of the patent document) and I am unsure if the information is summarized well at the beginning of the document, so I might need to go the hard way. Your suggestion sounds interesting. This is how I understood it:

Suppose I use the same 225 sliding window with 200 token strides. Then query is cut up into m pieces and document is cut into n pieces. Q = {q_1, ..., q_m} and D = {d_1, ..., d_n}. Now for each of the pieces I pass them through ColBERT to obtain an embedding.

e_{q_1} = ColBERT(q1) ... e{q_m} = ColBERT(q_m)

and

e_{d_1} = ColBERT(d1) ... e{d_n} = ColBERT(d_n}

Now we obtain a concatenation of these embeddings for both query and document:

eq = Concatenate(e{q1}, ..., e{q_m}) ed = Concatenate(e{d1}, ..., e{d_m})

Then we calculate the cosine similarity of e_q and e_d

score = MaxSim(e_q, e_d)

Is this the process you described in your answer? Let me know if I understood correctly :)

okhat commented 2 years ago

This is exactly what I described indeed. (To keep things simple, focus on re-ranking and not end-to-end retrieval. Retrieval can be done based on the first passage of Q and D, or using a model that has no length limitation like BM25.)

puzzlecollector commented 2 years ago

@okhat Thanks. Also, I was looking at the training code (training.py) and what do the arguments args.rank and args.nranks mean?

**edit: I guess it is related to the rank of some process that is running (something to do with GPU usage)

Another question: It seems like the training is taking place for a single epoch only (with max steps specified) so I guess this is why there is no logic for printing out the validation loss after every training epoch?

puzzlecollector commented 2 years ago

@okhat I have a few more questions:

  1. How exactly does batching work? Specifically, looking at this training loop

    for batch_idx, BatchSteps in zip(range(start_batch_idx, args.maxsteps), reader):
        this_batch_loss = 0.0
    
        for queries, passages in BatchSteps:
            with amp.context():
                scores = colbert(queries, passages).view(2, -1).permute(1, 0)
                loss = criterion(scores, labels[:scores.size(0)])
                loss = loss / args.accumsteps
    
            if args.rank < 1:
                print_progress(scores)
    
            amp.backward(loss)
    
            train_loss += loss.item()
            this_batch_loss += loss.item()
    
        amp.step(colbert, optimizer)
    
        if args.rank < 1:
            avg_loss = train_loss / (batch_idx+1)
    
            num_examples_seen = (batch_idx - start_batch_idx) * args.bsize * args.nranks
            elapsed = float(time.time() - start_time)
    
            log_to_mlflow = (batch_idx % 20 == 0)
            Run.log_metric('train/avg_loss', avg_loss, step=batch_idx, log_to_mlflow=log_to_mlflow)
            Run.log_metric('train/batch_loss', this_batch_loss, step=batch_idx, log_to_mlflow=log_to_mlflow)
            Run.log_metric('train/examples', num_examples_seen, step=batch_idx, log_to_mlflow=log_to_mlflow)
            Run.log_metric('train/throughput', num_examples_seen / elapsed, step=batch_idx, log_to_mlflow=log_to_mlflow)
    
            print_message(batch_idx, avg_loss)
            manage_checkpoints(args, colbert, optimizer, batch_idx+1)

It seems that if I set the bsize argument to 32, and doc and query max length as 256, then the queries and the passages size seem to be [16,256], [16,256] respectively. From this I guess we process bsize/2 queries and bsize/2 documents for each training step?

  1. Also, while training the log prints out the loss and the scores of positive, negative passages as well as their difference. You mentioned before that the training tries to maximize this difference, so as the training progresses, I should expect to see the loss decreasing and at the same time the difference getting larger. Is this concept correct?
puzzlecollector commented 2 years ago

@okhat

I'm planning to try out sliding window to split up long query and document then train ColBERT using my MaxP method.

As a recap, my MaxP method is: set sliding window size to 180 and overlap to 20. So Each passage has 180 words and the slices have 20 words overlap. Using this method, I will preprocess my triplet train set. Suppose split(x) splits the text using the method I described. And |split(x)| denotes the number of parts text x was split into. So using window of 180 words and overlap of 20 words, if I managed to split the text x into 4 parts, then |split(x)| = 4. Now, if I apply this to my triplet data set, then the size of my triplet dataset will become

\sum_{i=1}^{K} |split(query_i)| |split(positive_i)| |split(negative_i)|

Where K is the total number of samples in the train dataset. Once I have this sliced dataset, I will train the ColBERT with this dataset. The part I have to change is the inference part. Given a query and a document,

{q_1,...,q_n} = split(Q) {d_1,...,d_m} = split(D)

and I want to calculate the score via the following

S_q1 = Max(ColBERT(q_1, d_1), ... , ColBERT(q_1, d_m)) ... S_qn = Max(ColBERT(q_n, d_1), ..., ColBERT(q_n, d_m))

The final score is calculated by

S = Max(S_q1,...,S_qn)

If I want to implement this, which part of the code should I change? I suspect it is somewhere in the evaluation folder...

okhat commented 2 years ago

If you want to try this, my suggestion is to use the existing standard colbert as a retriever (for top-100 or top-1000 results) and then write your own re-ranking code on top of that. Modifying the ranking code is quite difficult, and I'm afraid I won't be able to support it in general. It's not in the evaluation directory, but rather in ranking (for v1 code) and search iirc (in v2 code).

kaushicaravind2000 commented 2 years ago

I have been trying to classify the document into suicidal or not. The document has 2 columns, 1 the tweet and the other is the label mentioning if its suicidal or not. Need help in training the model , building a saved model to get the doc and classify it. Currently I'm using the humour detection saved model on the suicidal detection and the accuracy is no more than 60%. Kindly help on this issue. Thank you in advance

okhat commented 2 years ago

@kaushicaravind2000 I will re-open and respond in your other issue.