Closed divyakyatam closed 4 years ago
We used BERT's wordpiece tokenizer.
Thank You @ramakumar1729.
@ramakumar1729 May I also please know how you have created vocab.txt, Before or after tokenising. As I see, vocab file is passed while defining full tokeniser tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)
Could you also please help me with this.
Thanks in advance
We used bert-base checkpoint and the vocab.txt
provided at the BERT github repo.
@ramakumar1729 Thanks again
Hey @ramakumar1729, I have implemented both the tokenisers mentioned below and still couldn't reproduce the results.
from bert import bert_tokenization
tokenizer = bert_tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)
.
and
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Have you got any suggestions or reasons to improve the accuracy?
Thanks
@divyakyatam : Can you compare the data you generated with data we provide at http://ciir.cs.umass.edu/downloads/Antique/tf-ranking/.
Hi @ramakumar1729,
I did compare with the files provided here
I see ELWC tfrecords has many features like document_mask_id, Bert_embedding_tokens, e.t.c
Tfrecords that I generated only have query_tokens
, document_tokens
and relevance
. As far as my understanding, only these three features are being used to train tf_ranking
, right ?
Please do correct me if I misunderstood.
########################OBSERVATIONS############################# Here are the results using data provided here
Saving dict for global step 15000: global_step = 15000, labels_mean = 1.8611765, logits_mean = 2.5329251, loss = -0.8419584, metric/arp = 8.8508215, metric/ndcg@1 = 0.5982143, metric/ndcg@10 = 0.7938485, metric/ndcg@3 = 0.6889765, metric/ndcg@5 = 0.7227536, metric/ordered_pair_accuracy = 0.67115384, metric/weighted_ndcg@1 = 0.5982143, metric/weighted_ndcg@10 = 0.7938485, metric/weighted_ndcg@3 = 0.6889765, metric/weighted_ndcg@5 = 0.7227536
whereas, curated data with following tokenizer, there is still large scope to improve results
from transformers import BertTokenizer
BertTokenizer.from_pretrained('bert-base-uncased')
Saving dict for global step 15000: global_step = 15000, - Bert tokeniser labels_mean = 1.397762, logits_mean = 2.4103777, loss = -0.6839987, metric/arp = 15.286026, metric/ndcg@1 = 0.51339287, metric/ndcg@10 = 0.5709844, metric/ndcg@3 = 0.54329383, metric/ndcg@5 = 0.5581025, metric/ordered_pair_accuracy = 0.58394027, metric/weighted_ndcg@1 = 0.51339287, metric/weighted_ndcg@10 = 0.5709844, metric/weighted_ndcg@3 = 0.54329383, metric/weighted_ndcg@5 = 0.5581025
Thanks in advance
Hi @ramakumar1729 ,
In regards to the above discussion, I have taken a look at this issue and understand that we should generate vocab.txt after applying tokenisation. But, above you have mentioned to use vocab.txt
from bert github repo
Thanks
Hi,
I also found difference in results when running notebook version of TF-Ranking are higher compared to script version
handling_sparse_features.ipynb
-notebook version
({'global_step': 15000, 'labels_mean': 1.9630322, 'logits_mean': 1.8256326, 'loss': -0.8472472, 'metric/ndcg@1': 0.705, 'metric/ndcg@10': 0.8444919, 'metric/ndcg@3': 0.7445483, 'metric/ndcg@5': 0.78524727}, [])
tf_ranking_tfrecord.py
- script version
Saving dict for global step 15000: global_step = 15000, labels_mean = 1.8611765, logits_mean = 1.9152381, loss = -0.79547465, metric/arp = 8.963338, metric/ndcg@1 = 0.68303573, metric/ndcg@10 = 0.80916977, metric/ndcg@3 = 0.69290155, metric/ndcg@5 = 0.7276103, metric/ordered_pair_accuracy = 0.6492308, metric/weighted_ndcg@1 = 0.68303573, metric/weighted_ndcg@10 = 0.80916977, metric/weighted_ndcg@3 = 0.69290155, metric/weighted_ndcg@5 = 0.7276103
Could you please explain why is that difference ?
Thanks
Hi @ramakumar1729,
While comparing tf_records, what I couldn't understand was How did you decide on number of documents for each query
If you could please answer this, it would be really very helpful.
Thanks
@divyakyatam : For a custom dataset, it is better to create your own vocabulary. Note that the purpose of this demo is to show reasonable training curves on Antique dataset, not the best possible performance of a neural network model.
One difference I see between the script and demo version is that the list_size
is 50 in the demo and set to None
in the script. None
indicates variable number of documents can be present per query and are processed, whereas a fixed list size truncates or pads the list. This can be causing the difference. Note that for fair evaluation, the eval_input_fn
should have the same list_size
.
Afaik, Antique dataset has a variable number of documents per query. We do not do any sampling or truncation as such. Let me know if you have any other questions.
Thank you @ramakumar1729 for the support.
Hi TFR Team,
I tried creating tf-records from raw ANTIQUE dataset, but I couldn't reproduce similar results. Accuracies are pretty low.
Could you please share with us on what kind of tokenisation you have used to create document and query tokens
Thanks.