multi-label classification

jesuiskelly commented 3 years ago

Hi there, thanks for the tutorial here: https://blogs.rstudio.com/ai/posts/2019-09-30-bert-r/ It's very useful! I wonder if you have one on multi-label classification? (For essentially the same dataset?) Or some code that helps me do that? Thank you very much in advance!

turgut090 commented 3 years ago

Hi @jesuiskelly . For the same dataset, try to set random labels from 1 to 3 for the output column. Then, change this part:

input_1 = get_layer(model,name = 'Input-Token')$input
input_2 = get_layer(model,name = 'Input-Segment')$input
inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=3L, activation='softmax', # 3 labels, so 3 units and activation to softmax
                         kernel_initializer=initializer_truncated_normal(stddev = 0.02),
                         name = 'output')

model = keras_model(inputs = inputs,outputs = outputs)

And to categorical cross entropy:

model %>% compile(
  k_bert$AdamWarmup(decay_steps=decay_steps, 
                    warmup_steps=warmup_steps, lr=learning_rate),
  loss = 'categorical_crossentropy',
  metrics = 'accuracy'
)

jesuiskelly commented 3 years ago

Thanks so much for coming back to me. This is very helpful! I tried it on my code and made a couple of changes:

the loss function is: sparse_categorical_corssentropy. I tried categorical_crossentropy first but got an error. The message suggested this instead. It worked.
I generated six labels: 0 - 5. It starts with 0 because when I specify units=6L, python expects 0 to 5. (I got an error for using 6 at first)

So the model was fitted, however, I'm concerned that I might have made some mistakes as the accuracy is very low: 0.2042.

The labels I used are not randomly generated, but rather, taken from here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/), which is virtually identical dataset, except that in each of the label columns, it's 0 or 1. I created the target column using the following code:

` train_data_raw <- read_csv(paste0(path_proj, path_data, "/train.csv")) # downloaded from link above tmp_data = data.table(train_data_raw[1:42000,])

tmp_col = names(train_data_raw[3:length(names(tmp_data))])

tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col]

tmp_data[is.na(target)]

unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>% .[target=="toxic", target := 1L] %>% .[target=="insult", target := 2L] %>% .[target=="severe_toxic", target := 3L] %>% .[target=="identity_hate", target := 4L] %>% .[target=="threat", target := 5L]

unique(tmp_data[, target])

train_data = tmp_data[1:40000,] test_data = tmp_data[40001:42000,] `

In other words, given that my labels are not random, I'd expect BERT to deliver much higher accuracy. Do you think I have done something wrong?

Separately, it would be good to know if there's a way to get a probability of prediction for each of the labels if possible. I copy below my code in case if it of help.

Thank you so much for your help and advice again!

` library(tidyverse) library(keras) library(reticulate) library(data.table) Sys.timezone() Sys.setenv(TZ = "UTC") options(scipen = 999)

Sys.setenv(TF_KERAS=1)

reticulate::py_config() #3.6 reticulate::py_module_available('keras_bert') # TRUE tensorflow::tf_version() k_bert = import('keras_bert')

path_b_pret = paste0(path_proj, path_data, "/uncased_L-12_H-768_A-12") path_b_conf = file.path(path_b_pret, "bert_config.json") path_b_chkp = file.path(path_b_pret, "bert_model.ckpt") path_b_vcab = paste0(path_b_pret, "/vocab.txt")

token_dict = k_bert$load_vocabulary(path_b_vcab) tokenizer = k_bert$Tokenizer(token_dict)

seq_length = 50L bch_size = 70 epochs = 1 learning_rate = 1e-4

DATA_COLUMN = 'comment_text' LABEL_COLUMN = 'target'

model = k_bert$load_trained_model_from_checkpoint( path_b_conf, path_b_chkp, training=T, trainable=T, seq_len=seq_length)

summary(model)

tokenize_fun = function(dataset) { c(indices, target, segments) %<-% list(list(),list(), list()) for ( i in 1:nrow(dataset)) { c(indices_tok, segments_tok) %<-% tokenizer$encode(dataset[[DATA_COLUMN]][i], max_len=seq_length) # encode with padding indices = indices %>% append(list(as.matrix(indices_tok))) target = target %>% append(dataset[[LABEL_COLUMN]][i]) segments = segments %>% append(list(as.matrix(segments_tok))) } return(list(indices, segments, target)) }

tmp_data = data.table(train_data_raw[1:42000,]) tmp_col = names(train_data_raw[3:length(names(tmp_data))]) tmp_data[, target := names(.SD)[max.col(.SD)], .SDcols=tmp_col] tmp_data[is.na(target)] unique(tmp_data[, target])

tmp_data[target=="obscene", target := 0L] %>% .[target=="toxic", target := 1L] %>% .[target=="insult", target := 2L] %>% .[target=="severe_toxic", target := 3L] %>% .[target=="identity_hate", target := 4L] %>% .[target=="threat", target := 5L]

unique(tmp_data[, target]) # check

train_data = tmp_data[1:40000,] test_data = tmp_data[40001:42000,]

c(x_train, x_segment, y_train) %<-% tokenize_fun(train_data)

train = do.call(cbind,x_train) %>% t() segments = do.call(cbind,x_segment) %>% t() targets = do.call(cbind,y_train) %>% t()

concat = c(list(train),list(segments))

c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps( targets %>% length(), batch_size=bch_size, epochs=epochs )

input_1 = get_layer(model,name = 'Input-Token')$input input_2 = get_layer(model,name = 'Input-Segment')$input inputs = list(input_1,input_2)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=6L, activation='softmax', kernel_initializer=initializer_truncated_normal(stddev = 0.02), name = 'output')

model = keras_model(inputs = inputs, outputs = outputs) freeze_weights(model, from = "NSP-Dense") # not sure if this should be kept summary(model)

model %>% compile( k_bert$AdamWarmup(decay_steps=decay_steps, warmup_steps=warmup_steps, lr=learning_rate), loss = 'sparse_categorical_crossentropy', # for multi-label metrics = 'accuracy' )

history = model %>% fit(concat, targets, epochs=epochs, batch_size=bch_size, validation_split=0.2)

c(x_test, x_t_segment, y_test) %<-% tokenize_fun(test_data)

x_test = do.call(cbind,x_test) %>% t() x_t_segment = do.call(cbind,x_t_segment) %>% t()

concat2 = c(list(x_test),list(x_t_segment))

res = model %>% predict(concat2)

`

turgut090 commented 3 years ago

Please, take a look at this kernel. https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing- From that kernel:

Important Note: In general,
*For binary classification, we can have 1 output units, use sigmoid activation in the output layer and use binary cross-entropy loss

*For multi-class classification, we can have N output units, use softmax activation in the output layer and use categorical cross-entropy loss

*For multi-label classification, we can have N output units, use sigmoid activation in the output layer and use binary cross-entropy loss

Maybe this could help you to get a better score? And please use AUC as metrics https://www.tensorflow.org/api_docs/python/tf/keras/metrics/AUC

jesuiskelly commented 3 years ago

Hi there, thanks for this. It does help. I've now:

updated the targets to be a matrix of N of rows x 6 (==no of labels)
applied the sigmoid activation and binary_crossentropy loss function

Now I've got an AUC of ~0.6 to ~0.7 (dependent on different learning rate, etc), which is much better.

The only place where I'm stuck now is getting meaningful prediction results. The code "model %>% predict(concat2)" produces the following, which makes no sense given the severe lack of variations of the numbers between rows. I tried predict_proba() but it's not a method available for this model. Any ideas on how I can get the probability of each label out?

Thank you so much for your help again!

results

turgut090 commented 3 years ago

I think you are doing it right. This is how it is done from the python side: https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-

y_preds = rnn_model.predict(test_data)
#Assign the predictions by the model in the final test dataset
df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred

Just take more data. At least 200k rows and see if it helps.

jesuiskelly commented 3 years ago

Thanks so much. Good to know that at least it's not my mistake :)

Great, will experiment so more. Thanks again for your help and for making the tutorial available.

Much appreciated! Kelly

On Mon, Feb 22, 2021 at 6:36 PM Turgut notifications@github.com wrote:

I think you are doing it right. This is how it is done from the python side:

https://www.kaggle.com/anirbansen3027/jtcc-multilabel-lstm-keras#3.-Text-Preprocessing-

y_preds = rnn_model.predict(test_data)

Assign the predictions by the model in the final test dataset

df_test[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]] = y_pred

Just take more data. At least 200k rows and see if it helps.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/henry090/BERT-from-R/issues/4#issuecomment-783583741, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZT2YOXQQLV7RDD6G53WS3TAKP3VANCNFSM4X2Y2EFA .

turgut090 / BERT-from-R

multi-label classification #4

Assign the predictions by the model in the final test dataset