Closed alexcg1 closed 3 years ago
You get very similar results for very different queries
Since we are using a Transformer based on the Kaggle Dataset, I feel the Transformer is not fine tuned for the dataset provided. So before actually querying we should makes changes to our indexing (index.yml) to fine tune our model.
/md !TransformerTorchEncoder with: pooling_strategy: auto pretrained_model_name_or_path: distilbert-base-cased
max_length: 192
changing values for max_length and using improved pre trained models for the dataset
Pretty much what I had in mind when I set you this issue @shivaylamba ;) Any thoughts on a better model? I tested sentence_bert and got slightly better results.
For fine-tuning, that's something we haven't yet implemented in Jina
How about adding segmenter and ranker like https://github.com/jina-ai/examples/tree/master/multires-lyrics-search/pods? Will that help? I thought Wikipedia example is just a simplified one of multires-lyrics-search
Wikipedia example is intended to be (and to remain) very simple. That means it's true there's only so far we can push it while keeping that simplicity.
My thoughts:
max_length
, other simple YAML parametersThe "Not OK to adjust" stuff can be done in later tutorials though. Just not this super-simple one
@alexcg1 Can we also have a look at GPT-2 as a model of choice. Because I also looked into Glove vectors and they perform worse than BERT based models. So since you looked at the sentence_bert model, a step up from that would be the GPT-2/GPT-3.
Also do you think if we can probably scrape more example datasets from Wikipedia, rather than just the 50 as an example could be helpful.
I've seen folks trying to use GPT-2 with Jina before and it doesn't really work. Jina likes Fill-Mask models in my experience. But you're welcome to give it a shot with GPT-2. GPT-3 would need API access which may be an obstacle.
Re 50 sentences in the example, that's in toy-input.txt
, right? We have a script (get_data.sh
) to download thousands of sentences, and can then specify how many you want to index with export JINA_MAX_DOCS=30000
or whatever.
I'm testing now with dataset of 30,000 sentences and embedding size of 384
So, embedding size of 384
doesn't really make much difference with dataset of 30k sentences. Compared to Dockerized example (also 30k sentences) it produces okay results for forest
or computer
but it's pretty bad on Josef Stalin
.
For reference:
Model | Time to index | Dataset size | max_length |
---|---|---|---|
deepset/sentence_bert | 5:44.81 | 3k | 192 |
bert-base-cased | 3:04.33 | 3k | 192 |
bert-base-cased | 41:09.68 | 30k | 384 |
The Dockerized example is giving much worse results than running something similar on bare metal. I think something might've gone wrong last time I built and pushed the Docker image. Maybe it only indexed a smaller set that time?
Dockerized version seemingly loves to bring up result list that includes Alice from Lourdes becoming a saint or something like that
FWIW, distilbert-base-uncased
returns MUCH worse results than distilbert-base-cased
.
Searching forest
in dataset of 30k sentences and max_size
of 192
:
0:"Not to be confused with Epperson. "
1:"It would be easy to catch these people. "
2:"It's nature up close, but not too personal. "
3:"There is nothing unnatural about it. "
4:"Kiger is a surname. "
5:"Cork is a surname. "
6:"Rather Than the usual dubbed song . "
7:"Munks is a surname. "
8:"They also have their own island. "
9:"Dalto is a surname. "
0:"The forests are inhabited by wild Water Buffalo. "
1:"It contains about 380 native plant species, of which 45 are classified as endangered. "
2:"The flora and fauna include nationally scarce plants and insects including a species of fly unrecorded elsewhere in the United Kingdom. "
3:"It is also a natural habitat for wild animals such as the Angulate Tortoise, the Small Gray Mongoose and the endangered Cape Rain Frog. "
4:"Often this produce marshes, but in some cases wet meadows may be produced. "
5:"Species occur in the mountains up to 4200 meters in elevation. "
6:"It is near Mount Stanford on the Sierra Crest, in the Inyo National Forest. "
7:"The soils which range from acid to alkaline and front wet to dry gives rise to a diverse woodland structure. "
8:"Through harvesting less, there is enough biomass left in the forest, so that the forest may stay healthy and still stay maintained. "
9:"It is generally located in mountains, below the upper montane vegetation type. "
One thing I just noticed:
Dockerized workspace
size: 128kb
Bare metal workspace
size: 102.5mb
So the last version i Dockerized must've indexed much less data than I thought. I'll fix that now
@BastinJafari this explains the really bad results before (tho tbh searching a bigger dataset for Josef Stalin
still returns few good results just due to number of Docs we're indexing. If we indexed the entire dataset of nearly 8m sentences, there'd be 750 that include the text Stalin
)
Nice to hear you found the culprit :)
The culprit was my own fool self who messed up the indexing before! Not even on par with a Scooby-Doo villain reveal :laughing:
Hey @alexcg1 @BastinJafari should we keep this open or close it?
Closed @FionnD !
Based on feedback from team members (@BastinJafari), search results from Wikipedia sentence search should be improved.
@BastinJafari could you elaborate on that and upload the screenshots you showed me?
@shivaylamba could you take a quick look and make a PR or comment here with improvements? (I've looked myself and I think it's likely just changing a few YAML parameters)
@zhenwang23 tagging you here to keep you in the loop