google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.3k stars 9.62k forks source link

TensorFlow Hub Module? #8

Closed HanGuo97 closed 5 years ago

HanGuo97 commented 6 years ago

Thanks for releasing BERT!

I'm just wondering if BERT will be available on TensorFlow Hub like ELMO (for either fine-tuning or extracting features)?

jacobdevlin-google commented 6 years ago

A TF-hub wrapper around BertModel will be available soon, but this module will assume the input has been tokenized in a compatible with (i.e., with tokenization.py). The tricky part is the tokenization.

loretoparisi commented 6 years ago

@jacobdevlin-google my opinion is that you should go to the C++ implementation, because of its architecture and general purpose application, a BERT standalone version will be the best option definitively (see FastText, StarSpace, etc.). Also as you remarked tokenization it's a separate process, just if we consider BPE first, double bytes, non word boundary languages, diacritics, etc. this is clearly a separate stage. Of course a TF hub module is welcomed, we are using the ELMO right now as suggested by AllenNLP as well.

mvss80 commented 6 years ago

Thank you, that would be very helpful! It would be great if the TF-hub wrapper can generate sentence embeddings like the Universal Sentence Encoder.

For now, what is the best way to obtain a sentence embedding? If I extract ELMo like feature vectors, what is the "right" way to combine vectors that are generated for each token in the sentence?

jacobdevlin-google commented 6 years ago

We don't have any particular recipe for generating sentence embeddings out-of-the-box. Using the output from the "classifier" token is definitely not the right way to go, but in terms of how to combine the per-word feature vectors into a single feature vector probably depends on what you're trying to do. If you're trying to create a feature vector where you don't need to learn any parameters, just averaging all of the word vectors might be the only sensible way to go.

If you can learn new parameters, my goto recipe would be "attention pooling" where the key tensor is just a learned weight vector and you attend over all of the BERT feature vectors, and then feed that into one or more fully connected layers.

zhaopku commented 6 years ago

@jacobdevlin-google Hi, do you have an estimate of when the TF hub module will be released?

jacobdevlin-google commented 6 years ago

We're running into a bug that we're having trouble figuring out the cause of so that's delaying things. Once that's resolved we'll get the initial version out.

jageshmaharjan commented 6 years ago

So, idea is to tokenize the sentence using "tokenization.py" and later run with tf_hub (i.e soon, when available). using the input as a sentence wouldn't work with the tf_hub module, right?. If we input the token as a list it should follow the standard format as done by "tokenization.py" any other tokenization doesn't work, right? eg: { [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] }

loretoparisi commented 5 years ago

@jacobdevlin-google any update for the TF hub module? I'm currently using the Universal Sentence Encoder hub module, but I assume that BERT should give better results in the STS task, etc. Thank you.

miweru commented 5 years ago

I tested BERT with SentEval and a few averaging techniques like tf-idf and weighted averaging and max_pooling. The out of the box results were not better than fasttext-bow averaged on the STS Tasks.

(The Results in the MR, CR, SUBJ and SST-2 Tasks were better than the best results reported in the SentEval Paper. So the averaging and max_pooling works fine on some tasks.)

On the other hand changing the word-order has strong effects on the resulting averaged vector. (and that produces very exiting results as this simple bow is very sensible to changing the word order)

taylorchu commented 5 years ago

@miweru Could you share the number and your experiment setup more?

Thanks!

hsm207 commented 5 years ago

We don't have any particular recipe for generating sentence embeddings out-of-the-box. Using the output from the "classifier" token is definitely not the right way to go, but in terms of how to combine the per-word feature vectors into a single feature vector probably depends on what you're trying to do.

@jacobdevlin-google why is using the output from the "classifier" token to create a sentence embedding wrong?

miweru commented 5 years ago

Could you share the number and your experiment setup more?

https://github.com/facebookresearch/SentEval I extracted the features from BERT second to last layer, averaged them and implemented them in the automatic evaluation of sentEval. bert_res bert (The rest of the table is taken from the original Paper, but it is STS12-16 not SST)

probing tasks These are the Results of the probing tasks (i hadnt much time and the evaluation takes a while so they are not complete yet) https://arxiv.org/abs/1805.01070 I find that very interesting as it shows the word order has a large effect on this BoW representation.

(Idf weighting is actually not very well defined for something like bert, because it makes rarer tokens have more weight and not one particular meaning (~vector) of the word.)

miweru commented 5 years ago

We don't have any particular recipe for generating sentence embeddings out-of-the-box. Using the output from the "classifier" token is definitely not the right way to go, but in terms of how to combine the per-word feature vectors into a single feature vector probably depends on what you're trying to do.

@jacobdevlin-google why is using the output from the "classifier" token to create a sentence embedding wrong?

I think it is not trained to be a real representation of the whole sentence, it is useful if we fine tune it on our specific sentence level task. (is it possible to use a typical sentence embedding unsupervised task to train this token to be a good sentence embedding?)

hoangcuong2011 commented 5 years ago

@miweru : "I think it is not trained to be a real representation of the whole sentence." -> I also think so. I think using the embedding for sentence (via embedding for CLS token) is not enough to learn sentence similarity. Specifically, I think something simple like computing cosine similarity for two sentences based on their two CLS tokens is not good enough.

To bring its best, we need to (1): fine tune it on our specific sentence level task, or (2): simply keep the embedding fixed but have an MLP on top of that so that MLP will be trained for a specific task.

But I was wondering: Is there anyone who has a different opinion on this?

Many thanks!

HanGuo97 commented 5 years ago

Just realized BERT is now available on the TensorFlow Hub, thanks the team for the effort!

loretoparisi commented 5 years ago

Adding that the example https://github.com/google-research/bert/blob/master/run_classifier_with_tfhub.py is not so clear.

Why the hub module signature takes encoded input (pre-processed tokens) only? I think it's a severe limitation. I would rather have used the same signature of the Universal Sentence Encoder, hiding the tokens signature in the module.

bert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
jageshmaharjan commented 5 years ago

yea, right. I was trying to figure about using, but unable. maybe need some extra task. +1

Armour commented 5 years ago

@AgoloCuongHoang I think TFHub docs has pretty much everything you want: https://tfhub.dev/s?q=bert

Rememnber that you need to do the pre-processing first before feeding the data, for example, use WordpieceTokenizer to split the input string into a list of wordpiece tokens and map these tokens to input_ids using convert_tokens_to_ids function. input_mask and segment_ids can usually be get easily from the pre-preprocessing pipeline. After that, just follow the docs like below:

# Feature based (used as embedding)
bert_module = hub.Module("https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1", trainable=False)
# Fine-tuning BERT model
# bert_module = hub.Module("https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1", trainable=True)

bert_inputs = dict(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids)
bert_outputs = bert_module(bert_inputs, signature="tokens", as_dict=True)
pooled_output = bert_outputs["pooled_output"]
sequence_output = bert_outputs["sequence_output"]
loretoparisi commented 5 years ago

@Armour could you please make an example of the pre-processing? Second question, SentencePiece is a better tool than WordPiece, would it be possibile to use SentencePiece tokens / ids as input rather than WordPiece? My two cents are that having the module on the hub without pre-processing gives no advantage to use it as integrated in other packages like in FLAIR, or in Bert-As-Service, even if not official of course.

HanGuo97 commented 5 years ago

I believe SentencePiece is not compatible with WordPiece.

hoangcuong2011 commented 5 years ago

Hi. If anyone can give us a full example including preprocessing as well, that would be extremely useful! Many thanks!

Armour commented 5 years ago

@loretoparisi Hi, I can create a repo for a simple example of using BERT module when I have time, probably on this weekend.

Yes, you can use SentencePiece if you pre-train your own BERT model with it. Take it in mind that the BERT model from TFHub was pre-trained with WordPieceTokenizer, so I think you should better keep it consistent unless you pre-train your own BERT model from scratch using SentencePiece, the expected result should be similar to the WordPieceTokenizer one. Let me know if I'm wrong :)

For your last question, I think Jacob is a better person to ask and I believe Google will also take that into consideration.

jageshmaharjan commented 5 years ago

@hoangcuong2011 , the example is in https://colab.sandbox.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb,

akshaykgupta commented 5 years ago

Hi, Is there a way to access intermediate layer outputs in the BERT Hub module? I would like to use a weighted sum of average embeddings from each layer (in the style of ELMo).

robmsylvester commented 5 years ago

Following the request from @akshaykgupta, are there any plans to expose these other layers, such as what is done with the inception networks available on tf hub? I've been trying to hack it together a bit but haven't quite stitched it all together. I have asked a question here on stack overflow for reference.

dipanjan commented 5 years ago

Tensorflow Hub's bert_uncased_L-12_H-768_A-12 documentations says,

We currently only support the tokens signature,...

However, Predicting Movie Reviews with BERT on TF Hub uses the tokenization_info signature.

tokenization_info = bert_module(signature="tokenization_info", as_dict=True)

Isn't that an inconsistency? How do I know what all are the signatures supported by the TF Hub BERT model?

lytum commented 5 years ago

A TF-hub wrapper around BertModel will be available soon, but this module will assume the input has been tokenized in a compatible with (i.e., with tokenization.py). The tricky part is the tokenization.

* The tokenization cannot be implemented with current TF ops

* We can wrap it in a custom op (we also have a C++ implementation available internally), but TF hub has limited support for custom ops.

* It's not trivial to figure out how the API of a tokenization op that will work for everyone. (E.g., how to deal with truncation vs. sliding window for long sequences, how to deal with projection from raw -> tokenized for the labeHas the C++ implementation for tokenization been published? Where could I get 

A TF-hub wrapper around BertModel will be available soon, but this module will assume the input has been tokenized in a compatible with (i.e., with tokenization.py). The tricky part is the tokenization.

* The tokenization cannot be implemented with current TF ops

* We can wrap it in a custom op (we also have a C++ implementation available internally), but TF hub has limited support for custom ops.

* It's not trivial to figure out how the API of a tokenization op that will work for everyone. (E.g., how to deal with truncation vs. sliding window for long sequences, how to deal with projection from raw -> tokenized for the labels).

where could we get the c++ implemnetation for tokenization? has it been released or not?

kyawminar commented 5 years ago

free go to my bold hren - sosi huy!

пн, 14 окт. 2019 г. в 01:45, lytum notifications@github.com:

A TF-hub wrapper around BertModel will be available soon, but this module will assume the input has been tokenized in a compatible with (i.e., with tokenization.py). The tricky part is the tokenization.

  • The tokenization cannot be implemented with current TF ops

  • We can wrap it in a custom op (we also have a C++ implementation available internally), but TF hub has limited support for custom ops.

  • It's not trivial to figure out how the API of a tokenization op that will work for everyone. (E.g., how to deal with truncation vs. sliding window for long sequences, how to deal with projection from raw -> tokenized for the labeHas the C++ implementation for tokenization been published? Where could I get

A TF-hub wrapper around BertModel will be available soon, but this module will assume the input has been tokenized in a compatible with (i.e., with tokenization.py). The tricky part is the tokenization.

  • The tokenization cannot be implemented with current TF ops

  • We can wrap it in a custom op (we also have a C++ implementation available internally), but TF hub has limited support for custom ops.

  • It's not trivial to figure out how the API of a tokenization op that will work for everyone. (E.g., how to deal with truncation vs. sliding window for long sequences, how to deal with projection from raw -> tokenized for the labels).

where could we get the c++ implemnetation for tokenization? has it been released or not?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/8?email_source=notifications&email_token=AKVGG5HRZGL7L7AMU2EFAP3QOQWSFA5CNFSM4GASPRL2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBDYEBI#issuecomment-541557253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVGG5FLVY5JB27TD2AZJETQOQWSFANCNFSM4GASPRLQ .