jina-ai / examples

Jina examples and demos to help you get started
https://docs.jina.ai
Apache License 2.0
454 stars 142 forks source link

wikipedia-sentence-search: results should be better #432

Closed alexcg1 closed 3 years ago

alexcg1 commented 3 years ago

Based on feedback from team members (@BastinJafari), search results from Wikipedia sentence search should be improved.

@BastinJafari could you elaborate on that and upload the screenshots you showed me?

@shivaylamba could you take a quick look and make a PR or comment here with improvements? (I've looked myself and I think it's likely just changing a few YAML parameters)

@zhenwang23 tagging you here to keep you in the loop

BastinJafari commented 3 years ago

You get very similar results for very different queries

image

image

shivaylamba commented 3 years ago

Since we are using a Transformer based on the Kaggle Dataset, I feel the Transformer is not fine tuned for the dataset provided. So before actually querying we should makes changes to our indexing (index.yml) to fine tune our model.

shivaylamba commented 3 years ago

/md !TransformerTorchEncoder with: pooling_strategy: auto pretrained_model_name_or_path: distilbert-base-cased

max_length: 96 default

max_length: 192 increased accuracy

max_length: 192

changing values for max_length and using improved pre trained models for the dataset

alexcg1 commented 3 years ago

Pretty much what I had in mind when I set you this issue @shivaylamba ;) Any thoughts on a better model? I tested sentence_bert and got slightly better results.

For fine-tuning, that's something we haven't yet implemented in Jina

Yongxuanzhang commented 3 years ago

How about adding segmenter and ranker like https://github.com/jina-ai/examples/tree/master/multires-lyrics-search/pods? Will that help? I thought Wikipedia example is just a simplified one of multires-lyrics-search

alexcg1 commented 3 years ago

Wikipedia example is intended to be (and to remain) very simple. That means it's true there's only so far we can push it while keeping that simplicity.

My thoughts:

The "Not OK to adjust" stuff can be done in later tutorials though. Just not this super-simple one

shivaylamba commented 3 years ago

@alexcg1 Can we also have a look at GPT-2 as a model of choice. Because I also looked into Glove vectors and they perform worse than BERT based models. So since you looked at the sentence_bert model, a step up from that would be the GPT-2/GPT-3.

Also do you think if we can probably scrape more example datasets from Wikipedia, rather than just the 50 as an example could be helpful.

alexcg1 commented 3 years ago

I've seen folks trying to use GPT-2 with Jina before and it doesn't really work. Jina likes Fill-Mask models in my experience. But you're welcome to give it a shot with GPT-2. GPT-3 would need API access which may be an obstacle.

Re 50 sentences in the example, that's in toy-input.txt, right? We have a script (get_data.sh) to download thousands of sentences, and can then specify how many you want to index with export JINA_MAX_DOCS=30000 or whatever.

I'm testing now with dataset of 30,000 sentences and embedding size of 384

alexcg1 commented 3 years ago

So, embedding size of 384 doesn't really make much difference with dataset of 30k sentences. Compared to Dockerized example (also 30k sentences) it produces okay results for forest or computer but it's pretty bad on Josef Stalin.

For reference:

Model Time to index Dataset size max_length
deepset/sentence_bert 5:44.81 3k 192
bert-base-cased 3:04.33 3k 192
bert-base-cased 41:09.68 30k 384
alexcg1 commented 3 years ago

The Dockerized example is giving much worse results than running something similar on bare metal. I think something might've gone wrong last time I built and pushed the Docker image. Maybe it only indexed a smaller set that time?

Dockerized version seemingly loves to bring up result list that includes Alice from Lourdes becoming a saint or something like that

alexcg1 commented 3 years ago

FWIW, distilbert-base-uncased returns MUCH worse results than distilbert-base-cased.

Searching forest in dataset of 30k sentences and max_size of 192:

distilbert-base-uncased

0:"Not to be confused with Epperson. "
1:"It would be easy to catch these people. "
2:"It's nature up close, but not too personal. "
3:"There is nothing unnatural about it. "
4:"Kiger is a surname. "
5:"Cork is a surname. "
6:"Rather Than the usual dubbed song . "
7:"Munks is a surname. "
8:"They also have their own island. "
9:"Dalto is a surname. "

distilbert-base-cased

0:"The forests are inhabited by wild Water Buffalo. "
1:"It contains about 380 native plant species, of which 45 are classified as endangered. "
2:"The flora and fauna include nationally scarce plants and insects including a species of fly unrecorded elsewhere in the United Kingdom. "
3:"It is also a natural habitat for wild animals such as the Angulate Tortoise, the Small Gray Mongoose and the endangered Cape Rain Frog. "
4:"Often this produce marshes, but in some cases wet meadows may be produced. "
5:"Species occur in the mountains up to 4200 meters in elevation. "
6:"It is near Mount Stanford on the Sierra Crest, in the Inyo National Forest. "
7:"The soils which range from acid to alkaline and front wet to dry gives rise to a diverse woodland structure. "
8:"Through harvesting less, there is enough biomass left in the forest, so that the forest may stay healthy and still stay maintained. "
9:"It is generally located in mountains, below the upper montane vegetation type. "
alexcg1 commented 3 years ago

One thing I just noticed:

Dockerized workspace size: 128kb Bare metal workspace size: 102.5mb

So the last version i Dockerized must've indexed much less data than I thought. I'll fix that now

@BastinJafari this explains the really bad results before (tho tbh searching a bigger dataset for Josef Stalin still returns few good results just due to number of Docs we're indexing. If we indexed the entire dataset of nearly 8m sentences, there'd be 750 that include the text Stalin)

BastinJafari commented 3 years ago

Nice to hear you found the culprit :)

alexcg1 commented 3 years ago

The culprit was my own fool self who messed up the indexing before! Not even on par with a Scooby-Doo villain reveal :laughing:

FionnD commented 3 years ago

Hey @alexcg1 @BastinJafari should we keep this open or close it?

alexcg1 commented 3 years ago

Closed @FionnD !