pt.text.sliding() efficiency

cmacdonald commented 3 years ago

pt.text.sliding() uses iterrows() this could probably be faster.

seanmacavaney commented 3 years ago

Would be nice if there was an iterator version as well to avoid dataframes when indexing. Maybe add an optional transform_iter() function to the transformer spec?

mihirs16 commented 1 year ago

(adding to the overall functionality of pt.text.sliding())

the tokenisation for SlidingWindowPassager uses a simple regex split on spaces, would it be beneficial to allow passing a custom tokenizer for getting the tokens?

for e.g. the number of tokens generated by simple regex split is less than the number of tokens generated by the default tok_model in the PyTerrier_t5 plugin, this may lead to each passage being silently truncated.

cmacdonald commented 1 year ago

This is a great idea. I think the tokenisation and joining functions could be generic? The defaults would be p.split (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L421) and ' '.join() (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L424)

Happy to receive a PR in this direction.

It would be interesting to know the level of impact that the silent trucation results in, as the following passage would contain that text, right?

seanmacavaney commented 1 year ago

I'd love if there was a way we could fit sentence segmentation into this as well. Splitting mid-sentence isn't ideal, and since most models are pretty sensitive to surface-level features like that.

mihirs16 commented 1 year ago

This is a great idea. I think the tokenisation and joining functions could be generic? The defaults would be p.split (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L421) and ' '.join() (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L424)

@cmacdonald Yupp, something like that could work great. But also, in addition to the tokeniser itself, maybe the sliding window size should be more dynamic? for e.g. if an input sequence must include the query, the window length should accommodate the length of the query to be added right?

It would be interesting to know the level of impact that the silent trucation results in, as the following passage would contain that text, right?

Consider the following experiment from CODEC

T5 [29] is state-of-the-art LM re-ranker that casts text re-ranking into a sequence-to-sequence setting and has shown impressive results. We use Pygaggle’s [21] MonoT5 model, which is fine-tuned using MS Marco. The model is not fine-tuned specifically on CODEC and is used in a transfer-learning setup because of the size and scope of the current benchmark. For document and entity ranking, we employ a max-passage approach similar to Nogueira et al. [29] to re-rank initial retrieval runs (BM25, BM25+RM3, ANCE-FirstP, ANCE-MaxP). The document is sharded in 512 tokens shards with a 256 overlapping token window (maximum 12 shards per document), and the highest scored shard is taken to represent the document.

I tried replicating this with code described as follows.

Indexing

def iter_file(filename):
    """
    load jsonl as a generator
    """
    with open(filename, 'rt') as file:
        for each_row in file:
            each_row_data = json.loads(each_row)
            each_row_data['docno'] = each_row_data.pop('id')
            each_row_data['text'] = each_row_data.pop('contents')
            yield each_row_data

indexer = IterDictIndexer(
    index_path='./index',
    meta={'docno': 32, 'text': 6144},
    overwrite=True,
    verbose=True
)
indexref = indexer.index(iter_file('CODEC/corpus/codec_documents.jsonl'))

Retrieval

# load index
index = pt.IndexRef.of("./index/data.properties")

# pipeline utils
tokenise = pt.rewrite.tokenise()

# tuned bm25
bm25_tuned = BatchRetrieve(
    index_location  = index, 
    wmodel          = "BM25", 
    controls        = {
        "bm25.b" : 0.6, 
        "bm25.k_1": 2.5,
        "bm25.k_3": 4.9
    },
)

# bm25 + rm3
rm3_tuned = pt.rewrite.RM3(index, fb_terms=95, fb_docs=20)
bm25_rm3_tuned = bm25_tuned >> rm3_tuned >> bm25_tuned

# monoT5 re-ranker
monoT5 = MonoT5ReRanker(
    model='castorini/monot5-base-msmarco-10k', 
    verbose=True
)
t5_window = pt.text.sliding(
    text_attr    = 'text', 
    length       = 512, 
    stride       = 256, 
    prepend_attr = None,
)
t5_pipe = pt.text.get_text(index, 'text') \
    >> t5_window >> monoT5 >> pt.text.max_passage()

# run experiments and eval
experiment = pt.Experiment(
    retr_systems = [
        tokenise >> bm25_tuned, 
        tokenise >> bm25_rm3_tuned,
        # tokenise >> bm25_tuned >> t5_pipe,
        tokenise >> bm25_rm3_tuned >> t5_pipe,
    ],
    names        = [
        'bm25', 
        'bm25_rm3',
        # 'bm25_t5',
        'bm25_rm3_t5',
    ],
    topics       = codec_se.get_topics(),
    qrels        = codec_se.get_qrels(),
    eval_metrics = [MAP, R@1000, nDCG@1000,  nDCG@10],
    save_dir     = './experiments/results/',
    # save_mode    = 'overwrite',
    verbose      = True,
)
print(experiment.head())
print("---> EXPERIMENT DONE <---")

Results

pt.Experiment:  67%|██████████████████████████████████████                   | 2/3 [00:00<00:00,  8.98system/s]calling sliding on df of 42000 rows

Token indices sequence length is longer than the specified maximum sequence length for this model (1434 > 512). Running this sequence through the model will result in indexing errors
monoT5: 100%|█████████████████████████████████████████████████████| 17120/17120 [6:12:15<00:00,  1.30s/batches] [[D [[D^[[D^[[D^[[D^[[D
pt.Experiment: 100%|█████████████████████████████████████████████████████| 3/3 [6:13:19<00:00, 7466.44s/system]
          name        AP    R@1000  nDCG@1000   nDCG@10
0         bm25  0.205983  0.759257   0.491709  0.301454
1     bm25_rm3  0.233003  0.808846   0.525245  0.323132
2  bm25_rm3_t5  0.027656  0.808846   0.301217  0.008882
---> EXPERIMENT DONE <---

The results for bm25, and bm25_rm3 pretty much match the expected results when reproduced; however, the results for bm25_rm3_t5 is significantly worse than they should be. Furthermore, the pipeline produces a warning that the input token length is larger than the maximum limit for the T5 model.

I am using PyTerrier_t5. Please do let me know if it's an error on my part, but as far as I understand, there are two problems - first that the p.split tokeniser's token count is mismatching the token count for MonoT5ReRanker's tokeniser, which is the t5-base. Also, for a Sequence-to-Sequence LM like t5, the input sequence is not just the passage to score but the query as well. The exact input sequence has the format - Query: q Document: d Relevant: (as per Document Ranking with a Pretrained Sequence-to-Sequence Model) which is a problem since the sliding window is fixed and can't accommodate a dynamic length depending on the Query tokens.

mihirs16 commented 1 year ago

Happy to receive a PR in this direction.

I'd be happy to work on this!

mihirs16 commented 1 year ago

I'd love if there was a way we could fit sentence segmentation into this as well. Splitting mid-sentence isn't ideal, and since most models are pretty sensitive to surface-level features like that.

@seanmacavaney how exactly would that work, first tokenizing sentences using nltk.tokenize.sent_tokenize and then extract tokens from the list of sentences, disregarding the last sentence if it goes over the upper bound of the window?

cmacdonald commented 1 year ago

however, the results for bm25_rm3_t5 is significantly worse than they should be.

You need a pt.rewrite.reset() in the RM3 pipeline @mihirs16. T5 won’t understand the reformulated query.

mihirs16 commented 1 year ago

You need a pt.rewrite.reset() in the RM3 pipeline @mihirs16. T5 won’t understand the reformulated query.

that was the problem! thanks a ton!

(test over smaller sample data)

terrier-org / pyterrier