Open cmacdonald opened 3 years ago
Would be nice if there was an iterator version as well to avoid dataframes when indexing. Maybe add an optional transform_iter() function to the transformer spec?
(adding to the overall functionality of pt.text.sliding()
)
the tokenisation for SlidingWindowPassager
uses a simple regex split on spaces, would it be beneficial to allow passing a custom tokenizer for getting the tokens?
for e.g. the number of tokens generated by simple regex split is less than the number of tokens generated by the default tok_model
in the PyTerrier_t5 plugin, this may lead to each passage being silently truncated.
This is a great idea. I think the tokenisation and joining functions could be generic? The defaults would be p.split (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L421) and ' '.join() (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L424)
Happy to receive a PR in this direction.
It would be interesting to know the level of impact that the silent trucation results in, as the following passage would contain that text, right?
I'd love if there was a way we could fit sentence segmentation into this as well. Splitting mid-sentence isn't ideal, and since most models are pretty sensitive to surface-level features like that.
This is a great idea. I think the tokenisation and joining functions could be generic? The defaults would be p.split (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L421) and ' '.join() (as per https://github.com/terrier-org/pyterrier/blob/master/pyterrier/text.py#L424)
@cmacdonald Yupp, something like that could work great. But also, in addition to the tokeniser itself, maybe the sliding window size should be more dynamic? for e.g. if an input sequence must include the query, the window length should accommodate the length of the query to be added right?
It would be interesting to know the level of impact that the silent trucation results in, as the following passage would contain that text, right?
Consider the following experiment from CODEC
T5 [29] is state-of-the-art LM re-ranker that casts text re-ranking into a sequence-to-sequence setting and has shown impressive results. We use Pygaggle’s [21] MonoT5 model, which is fine-tuned using MS Marco. The model is not fine-tuned specifically on CODEC and is used in a transfer-learning setup because of the size and scope of the current benchmark. For document and entity ranking, we employ a max-passage approach similar to Nogueira et al. [29] to re-rank initial retrieval runs (BM25, BM25+RM3, ANCE-FirstP, ANCE-MaxP). The document is sharded in 512 tokens shards with a 256 overlapping token window (maximum 12 shards per document), and the highest scored shard is taken to represent the document.
I tried replicating this with code described as follows.
def iter_file(filename):
"""
load jsonl as a generator
"""
with open(filename, 'rt') as file:
for each_row in file:
each_row_data = json.loads(each_row)
each_row_data['docno'] = each_row_data.pop('id')
each_row_data['text'] = each_row_data.pop('contents')
yield each_row_data
indexer = IterDictIndexer(
index_path='./index',
meta={'docno': 32, 'text': 6144},
overwrite=True,
verbose=True
)
indexref = indexer.index(iter_file('CODEC/corpus/codec_documents.jsonl'))
# load index
index = pt.IndexRef.of("./index/data.properties")
# pipeline utils
tokenise = pt.rewrite.tokenise()
# tuned bm25
bm25_tuned = BatchRetrieve(
index_location = index,
wmodel = "BM25",
controls = {
"bm25.b" : 0.6,
"bm25.k_1": 2.5,
"bm25.k_3": 4.9
},
)
# bm25 + rm3
rm3_tuned = pt.rewrite.RM3(index, fb_terms=95, fb_docs=20)
bm25_rm3_tuned = bm25_tuned >> rm3_tuned >> bm25_tuned
# monoT5 re-ranker
monoT5 = MonoT5ReRanker(
model='castorini/monot5-base-msmarco-10k',
verbose=True
)
t5_window = pt.text.sliding(
text_attr = 'text',
length = 512,
stride = 256,
prepend_attr = None,
)
t5_pipe = pt.text.get_text(index, 'text') \
>> t5_window >> monoT5 >> pt.text.max_passage()
# run experiments and eval
experiment = pt.Experiment(
retr_systems = [
tokenise >> bm25_tuned,
tokenise >> bm25_rm3_tuned,
# tokenise >> bm25_tuned >> t5_pipe,
tokenise >> bm25_rm3_tuned >> t5_pipe,
],
names = [
'bm25',
'bm25_rm3',
# 'bm25_t5',
'bm25_rm3_t5',
],
topics = codec_se.get_topics(),
qrels = codec_se.get_qrels(),
eval_metrics = [MAP, R@1000, nDCG@1000, nDCG@10],
save_dir = './experiments/results/',
# save_mode = 'overwrite',
verbose = True,
)
print(experiment.head())
print("---> EXPERIMENT DONE <---")
pt.Experiment: 67%|██████████████████████████████████████ | 2/3 [00:00<00:00, 8.98system/s]calling sliding on df of 42000 rows
Token indices sequence length is longer than the specified maximum sequence length for this model (1434 > 512). Running this sequence through the model will result in indexing errors
monoT5: 100%|█████████████████████████████████████████████████████| 17120/17120 [6:12:15<00:00, 1.30s/batches] [[D [[D^[[D^[[D^[[D^[[D
pt.Experiment: 100%|█████████████████████████████████████████████████████| 3/3 [6:13:19<00:00, 7466.44s/system]
name AP R@1000 nDCG@1000 nDCG@10
0 bm25 0.205983 0.759257 0.491709 0.301454
1 bm25_rm3 0.233003 0.808846 0.525245 0.323132
2 bm25_rm3_t5 0.027656 0.808846 0.301217 0.008882
---> EXPERIMENT DONE <---
The results for bm25
, and bm25_rm3
pretty much match the expected results when reproduced; however, the results for bm25_rm3_t5
is significantly worse than they should be. Furthermore, the pipeline produces a warning that the input token length is larger than the maximum limit for the T5
model.
I am using PyTerrier_t5. Please do let me know if it's an error on my part, but as far as I understand, there are two problems - first that the p.split
tokeniser's token count is mismatching the token count for MonoT5ReRanker
's tokeniser, which is the t5-base
. Also, for a Sequence-to-Sequence LM like t5
, the input sequence is not just the passage to score but the query as well. The exact input sequence has the format - Query: q Document: d Relevant:
(as per Document Ranking with a Pretrained Sequence-to-Sequence Model) which is a problem since the sliding window is fixed and can't accommodate a dynamic length depending on the Query tokens.
Happy to receive a PR in this direction.
I'd be happy to work on this!
I'd love if there was a way we could fit sentence segmentation into this as well. Splitting mid-sentence isn't ideal, and since most models are pretty sensitive to surface-level features like that.
@seanmacavaney how exactly would that work, first tokenizing sentences using nltk.tokenize.sent_tokenize
and then extract tokens from the list of sentences, disregarding the last sentence if it goes over the upper bound of the window?
however, the results for bm25_rm3_t5 is significantly worse than they should be.
You need a pt.rewrite.reset() in the RM3 pipeline @mihirs16. T5 won’t understand the reformulated query.
You need a pt.rewrite.reset() in the RM3 pipeline @mihirs16. T5 won’t understand the reformulated query.
that was the problem! thanks a ton!
(test over smaller sample data)
pt.text.sliding() uses iterrows() this could probably be faster.