ANTIQUE Dataset Tokenisation

divyakyatam commented 4 years ago

Hi TFR Team,

I tried creating tf-records from raw ANTIQUE dataset, but I couldn't reproduce similar results. Accuracies are pretty low.

Could you please share with us on what kind of tokenisation you have used to create document and query tokens

Thanks.

ramakumar1729 commented 4 years ago

We used BERT's wordpiece tokenizer.

divyakyatam commented 4 years ago

Thank You @ramakumar1729.

divyakyatam commented 4 years ago

@ramakumar1729 May I also please know how you have created vocab.txt, Before or after tokenising. As I see, vocab file is passed while defining full tokeniser tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)

Could you also please help me with this.

Thanks in advance

ramakumar1729 commented 4 years ago

We used bert-base checkpoint and the vocab.txt provided at the BERT github repo.

divyakyatam commented 4 years ago

@ramakumar1729 Thanks again

divyakyatam commented 4 years ago

Hey @ramakumar1729, I have implemented both the tokenisers mentioned below and still couldn't reproduce the results.

from bert import bert_tokenization

tokenizer = bert_tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True).

and

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Have you got any suggestions or reasons to improve the accuracy?

Thanks

ramakumar1729 commented 4 years ago

@divyakyatam : Can you compare the data you generated with data we provide at http://ciir.cs.umass.edu/downloads/Antique/tf-ranking/.

divyakyatam commented 4 years ago

Hi @ramakumar1729,

I did compare with the files provided here

I see ELWC tfrecords has many features like document_mask_id, Bert_embedding_tokens, e.t.c

Tfrecords that I generated only have query_tokens, document_tokens and relevance. As far as my understanding, only these three features are being used to train tf_ranking, right ?

Please do correct me if I misunderstood.

########################OBSERVATIONS############################# Here are the results using data provided here

Saving dict for global step 15000: global_step = 15000, labels_mean = 1.8611765, logits_mean = 2.5329251, loss = -0.8419584, metric/arp = 8.8508215, metric/ndcg@1 = 0.5982143, metric/ndcg@10 = 0.7938485, metric/ndcg@3 = 0.6889765, metric/ndcg@5 = 0.7227536, metric/ordered_pair_accuracy = 0.67115384, metric/weighted_ndcg@1 = 0.5982143, metric/weighted_ndcg@10 = 0.7938485, metric/weighted_ndcg@3 = 0.6889765, metric/weighted_ndcg@5 = 0.7227536

whereas, curated data with following tokenizer, there is still large scope to improve results

from transformers import BertTokenizer BertTokenizer.from_pretrained('bert-base-uncased')

Saving dict for global step 15000: global_step = 15000, - Bert tokeniser labels_mean = 1.397762, logits_mean = 2.4103777, loss = -0.6839987, metric/arp = 15.286026, metric/ndcg@1 = 0.51339287, metric/ndcg@10 = 0.5709844, metric/ndcg@3 = 0.54329383, metric/ndcg@5 = 0.5581025, metric/ordered_pair_accuracy = 0.58394027, metric/weighted_ndcg@1 = 0.51339287, metric/weighted_ndcg@10 = 0.5709844, metric/weighted_ndcg@3 = 0.54329383, metric/weighted_ndcg@5 = 0.5581025

Thanks in advance

divyakyatam commented 4 years ago

Hi @ramakumar1729 ,

In regards to the above discussion, I have taken a look at this issue and understand that we should generate vocab.txt after applying tokenisation. But, above you have mentioned to use vocab.txt from bert github repo

Thanks

divyakyatam commented 4 years ago

Hi,

I also found difference in results when running notebook version of TF-Ranking are higher compared to script version

handling_sparse_features.ipynb -notebook version

({'global_step': 15000, 'labels_mean': 1.9630322, 'logits_mean': 1.8256326, 'loss': -0.8472472, 'metric/ndcg@1': 0.705, 'metric/ndcg@10': 0.8444919, 'metric/ndcg@3': 0.7445483, 'metric/ndcg@5': 0.78524727}, [])

tf_ranking_tfrecord.py - script version

Saving dict for global step 15000: global_step = 15000, labels_mean = 1.8611765, logits_mean = 1.9152381, loss = -0.79547465, metric/arp = 8.963338, metric/ndcg@1 = 0.68303573, metric/ndcg@10 = 0.80916977, metric/ndcg@3 = 0.69290155, metric/ndcg@5 = 0.7276103, metric/ordered_pair_accuracy = 0.6492308, metric/weighted_ndcg@1 = 0.68303573, metric/weighted_ndcg@10 = 0.80916977, metric/weighted_ndcg@3 = 0.69290155, metric/weighted_ndcg@5 = 0.7276103

Could you please explain why is that difference ?

Thanks

divyakyatam commented 4 years ago

Hi @ramakumar1729,

While comparing tf_records, what I couldn't understand was How did you decide on number of documents for each query

If you could please answer this, it would be really very helpful.

Thanks

ramakumar1729 commented 4 years ago

@divyakyatam : For a custom dataset, it is better to create your own vocabulary. Note that the purpose of this demo is to show reasonable training curves on Antique dataset, not the best possible performance of a neural network model.

One difference I see between the script and demo version is that the list_size is 50 in the demo and set to None in the script. None indicates variable number of documents can be present per query and are processed, whereas a fixed list size truncates or pads the list. This can be causing the difference. Note that for fair evaluation, the eval_input_fn should have the same list_size.

Afaik, Antique dataset has a variable number of documents per query. We do not do any sampling or truncation as such. Let me know if you have any other questions.

divyakyatam commented 4 years ago

Thank you @ramakumar1729 for the support.

tensorflow / ranking

ANTIQUE Dataset Tokenisation #206