pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
941 stars 65 forks source link

BERT tokens #494

Closed JMMackenzie closed 1 year ago

JMMackenzie commented 2 years ago

Describe the solution you'd like Currently, PISA does not readily support BERT wordpiece tokens exam ##pl because of the ## being eaten by the tokenizer.

We should have support for a command line flag like --pretokenized (similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.

Checklist

elshize commented 2 years ago

@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more.

JMMackenzie commented 2 years ago

If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: https://github.com/castorini/anserini/commit/14b315d23e461734b6e36409bfa1745be5ba4de2

This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

A tokenizer that divides text at whitespace characters as defined by [Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.

I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts?

JMMackenzie commented 2 years ago

Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:

#ing
...
fish
...

And then at query time we might see 101: fish ##ing locations or something like that. This example is just made up but should explain what we need.

I think currently PISA would turn that query into fish ing locations and then maybe match ing with the wrong token or just not find it.

elshize commented 2 years ago

Ah, ok, so this would be an alternative parsing, correct? When --pretokenized is passed, we break on spaces, otherwise, business as usual?

As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes.

If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that.

JMMackenzie commented 1 year ago

Sure, I can generate a CIFF file if that would help!