Open jesuiskelly opened 3 years ago
Hi @jesuiskelly . For the same dataset, try to set random labels from 1 to 3 for the output column. Then, change this part:
input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)
dense = get_layer(model,name = 'NSP-Dense')$output
outputs = dense %>% layer_dense(units=3L, activation='softmax', # 3 labels, so 3 units and activation to softmax
kernel_initializer=initializer_truncated_normal(stddev = 0.02),
name = 'output')
model = keras_model(inputs = inputs,outputs = outputs)
And to categorical cross entropy:
model %>% compile(
k_bert$AdamWarmup(decay_steps=decay_steps,
warmup_steps=warmup_steps, lr=learning_rate),
loss = 'categorical_crossentropy',
metrics = 'accuracy'
)
Thanks so much for coming back to me. This is very helpful! I tried it on my code and made a couple of changes:
So the model was fitted, however, I'm concerned that I might have made some mistakes as the accuracy is very low: 0.2042.
The labels I used are not randomly generated, but rather, taken from here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/), which is virtually identical dataset, except that in each of the label columns, it's 0 or 1. I created the target column using the following code:
` train_data_raw <- read_csv(paste0(path_proj, path_data, "/train.csv")) # downloaded from link above tmp_data = data.table(train_data_raw[1:42000,])
tmp_col = names(train_data_raw[3:length(names(tmp_data))])
tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]
tmp_data[is.na(target)]
unique(tmp_data[, target])
tmp_data[target=="obscene", target := 0L] %>% .[target=="toxic", target := 1L] %>% .[target=="insult", target := 2L] %>% .[target=="severe_toxic", target := 3L] %>% .[target=="identity_hate", target := 4L] %>% .[target=="threat", target := 5L]
unique(tmp_data[, target])
train_data = tmp_data[1:40000,] test_data = tmp_data[40001:42000,] `
In other words, given that my labels are not random, I'd expect BERT to deliver much higher accuracy. Do you think I have done something wrong?
Separately, it would be good to know if there's a way to get a probability of prediction for each of the labels if possible. I copy below my code in case if it of help.
Thank you so much for your help and advice again!
` library(tidyverse) library(keras) library(reticulate) library(data.table) Sys.timezone() Sys.setenv(TZ = "UTC") options(scipen = 999)
Sys.setenv(TF_KERAS=1)
reticulate::py_config() #3.6 reticulate::py_module_available('keras_bert') # TRUE tensorflow::tf_version() k_bert = import('keras_bert')
path_b_pret = paste0(path_proj, path_data, "/uncased_L-12_H-768_A-12") path_b_conf = file.path(path_b_pret, "bert_config.json") path_b_chkp = file.path(path_b_pret, "bert_model.ckpt") path_b_vcab = paste0(path_b_pret, "/vocab.txt")
token_dict = k_bert$load_vocabulary(path_b_vcab) tokenizer = k_bert$Tokenizer(token_dict)
seq_length = 50L bch_size = 70 epochs = 1 learning_rate = 1e-4
DATA_COLUMN = 'comment_text' LABEL_COLUMN = 'target'
model = k_bert$load_trained_model_from_checkpoint( path_b_conf, path_b_chkp, training=T, trainable=T, seq_len=seq_length)
summary(model)
tokenize_fun = function(dataset) { c(indices, target, segments) %<-% list(list(),list(), list()) for ( i in 1:nrow(dataset)) { c(indices_tok, segments_tok) %<-% tokenizer$encode(dataset[[DATA_COLUMN]][i], max_len=seq_length) # encode with padding indices = indices %>% append(list(as.matrix(indices_tok))) target = target %>% append(dataset[[LABEL_COLUMN]][i]) segments = segments %>% append(list(as.matrix(segments_tok))) } return(list(indices, segments, target)) }
tmp_data = data.table(train_data_raw[1:42000,]) tmp_col = names(train_data_raw[3:length(names(tmp_data))]) tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col] tmp_data[is.na(target)] unique(tmp_data[, target])
tmp_data[target=="obscene", target := 0L] %>% .[target=="toxic", target := 1L] %>% .[target=="insult", target := 2L] %>% .[target=="severe_toxic", target := 3L] %>% .[target=="identity_hate", target := 4L] %>% .[target=="threat", target := 5L]
unique(tmp_data[, target]) # check
train_data = tmp_data[1:40000,] test_data = tmp_data[40001:42000,]
c(x_train, x_segment, y_train) %<-% tokenize_fun(train_data)
train = do.call(cbind,x_train) %>% t() segments = do.call(cbind,x_segment) %>% t() targets = do.call(cbind,y_train) %>% t()
concat = c(list(train),list(segments))
c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps( targets %>% length(), batch_size=bch_size, epochs=epochs )
input_1 = get_layer(model,name = 'Input-Token')$input input_2 = get_layer(model,name = 'Input-Segment')$input inputs = list(input_1,input_2)
dense = get_layer(model,name = 'NSP-Dense')$output
outputs = dense %>% layer_dense(units=6L, activation='softmax', kernel_initializer=initializer_truncated_normal(stddev = 0.02), name = 'output')
model = keras_model(inputs = inputs, outputs = outputs) freeze_weights(model, from = "NSP-Dense") # not sure if this should be kept summary(model)
model %>% compile( k_bert$AdamWarmup(decay_steps=decay_steps, warmup_steps=warmup_steps, lr=learning_rate), loss = 'sparse_categorical_crossentropy', # for multi-label metrics = 'accuracy' )
history = model %>% fit(concat, targets, epochs=epochs, batch_size=bch_size, validation_split=0.2)
c(x_test, x_t_segment, y_test) %<-% tokenize_fun(test_data)
x_test = do.call(cbind,x_test) %>% t() x_t_segment = do.call(cbind,x_t_segment) %>% t()
concat2 = c(list(x_test),list(x_t_segment))
res = model %>% predict(concat2)
`
Please, take a look at this kernel. https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing- From that kernel:
Important Note: In general,
*For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross-entropy loss
*For multi-class classification, we can have N output units, use softmax activation in the output layer and use categorical cross-entropy loss
*For multi-label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross-entropy loss
Maybe this could help you to get a better score? And please use AUC as metrics https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC
Hi there, thanks for this. It does help. I've now:
Now I've got an AUC of ~0.6 to ~0.7 (dependent on different learning rate, etc), which is much better.
The only place where I'm stuck now is getting meaningful prediction results. The code "model %>% predict(concat2)" produces the following, which makes no sense given the severe lack of variations of the numbers between rows. I tried predict_proba() but it's not a method available for this model. Any ideas on how I can get the probability of each label out?
Thank you so much for your help again!
I think you are doing it right. This is how it is done from the python side: https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred
Just take more data. At least 200k rows and see if it helps.
Thanks so much. Good to know that at least it's not my mistake :)
Great, will experiment so more. Thanks again for your help and for making the tutorial available.
Much appreciated! Kelly
On Mon, Feb 22, 2021 at 6:36 PM Turgut notifications@github.com wrote:
I think you are doing it right. This is how it is done from the python side:
https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-
y_preds = rnn_model.predict(test_data)
Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred
Just take more data. At least 200k rows and see if it helps.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/henry090/BERT-from-R/issues/4#issuecomment-783583741, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZT2YOXQQLV7RDD6G53WS3TAKP3VANCNFSM4X2Y2EFA .
Hi there, thanks for the tutorial here: https://blogs.rstudio.com/ai/posts/2019-09-30-bert-r/ It's very useful! I wonder if you have one on multi-label classification? (For essentially the same dataset?) Or some code that helps me do that? Thank you very much in advance!