Text generator for Embedding Layer getting out of index error near end of training dataset

atroiano commented 6 years ago

I am creating a custom embedding layer for a corpus of high dimensional codes,. I am using the code on this link as an example:

https://tensorflow.rstudio.com/blog/word-embeddings-with-keras.html

My only change is the window_size is 3. Everything is appears to be working correctly if I stop the model before getting within 3 text sequences of the end of my dataset. When it gets near the end, it errors and gives me an out of index.

  IndexError: list index out of range 
13.
stop(structure(list(message = "IndexError: list index out of range", 
    call = py_call_impl(callable, dots$args, dots$keywords), 
    cppstack = structure(list(file = "", line = -1L, stack = "C++ stack not available on this system"), .Names = c("file", 
    "line", "stack"), class = "Rcpp_stack_trace")), .Names = c("message",  ... 
12.
(structure(function (...) 
{
    dots <- py_resolve_dots(list(...))
    result <- py_call_impl(callable, dots$args, dots$keywords) ... 
11.
do.call(func, args) 
10.
call_generator_function(object$fit_generator, list(generator = generator, 
    steps_per_epoch = as.integer(steps_per_epoch), epochs = as.integer(epochs), 
    verbose = as.integer(verbose), callbacks = normalize_callbacks(view_metrics, 
        callbacks), validation_data = validation_data, validation_steps = as_nullable_integer(validation_steps),  ... 
9.
fit_generator(., skipgrams_generator(embed_layer_char, tokenizer, 
    skip_window, negative_samples = 1), steps_per_epoch = 10000, 
    epochs = 5) 
8.
function_list[[k]](value) 
7.
withVisible(function_list[[k]](value)) 
6.
freduce(value, `_function_list`) 
5.
`_fseq`(`_lhs`) 
4.
eval(quote(`_fseq`(`_lhs`)), env, env) 
3.
eval(quote(`_fseq`(`_lhs`)), env, env) 
2.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) 
1.
model %>% fit_generator(skipgrams_generator(embed_layer_char, 
    tokenizer, skip_window, negative_samples = 1), steps_per_epoch = 10000, 
    epochs = 5)

Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: list index out of range.

For example: My total dataset is 46193 observations and the code errors on the 5th epoch at 6190 with 10000 samples per epoch, so it errors on record 46190. I have seen it error on the last sample, so it errors when it hits record 46193.

Another factor, some of my words have punctuation in them. For example a sentence could be: 88152 C59.20 C23.20 None None None None



embed_layer_char <- stringi::stri_enc_toutf8(embed_layer_char)
embed_layer_char <- stringi::stri_enc_toascii(embed_layer_char)

#if less then 20k, need to set the number of words correctly.
guess <- 20000 # Use a guess here
tokenizer <- text_tokenizer(num_words = guess) %>%
  fit_text_tokenizer(embed_layer_char)
num_words <- length(tokenizer$word_index) %>%
  min(20000)
tokenizer <- text_tokenizer(num_words = num_words) %>%
  fit_text_tokenizer(embed_layer_char)

print(num_words)
#current number of word is = 2033

gen <- texts_to_sequences_generator(tokenizer, sample(embed_layer_char))

skipgrams_generator <- function(text, tokenizer, window_size, negative_samples=1) {
  sample_text <<- sample(text)
  gen <- texts_to_sequences_generator(tokenizer, sample_text)
  function() {
    skip <- generator_next(gen) %>%
      skipgrams(
        vocabulary_size = tokenizer$num_words, 
        window_size = window_size, 
        negative_samples = negative_samples
      )
    x <- transpose(skip$couples) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
    y <- skip$labels %>% as.matrix(ncol = 1)
    list(x, y)
  }
}

embedding_size <- 64  # Dimension of the embedding vector.
skip_window <- 5       # How many words to consider left and right.
num_sampled <- 1       # Number of negative examples to sample for each word.

input_target <- layer_input(shape = 1)
input_context <- layer_input(shape = 1)

embedding <- layer_embedding(
  input_dim = tokenizer$num_words+1, 
  output_dim = embedding_size, 
  input_length = 1, 
  name = "embedding"
)

target_vector <- input_target %>% 
  embedding() %>% 
  layer_flatten()

context_vector <- input_context %>%
  embedding() %>%
  layer_flatten()

dot_product <- layer_dot(list(target_vector, context_vector), axes = 1)
output <- layer_dense(dot_product, units = 1, activation = "sigmoid")

model <- keras_model(list(input_target, input_context), output)
model %>% compile(loss = "binary_crossentropy", optimizer = "adam")

summary(model)

model %>%
  fit_generator(
    skipgrams_generator(embed_layer_char, tokenizer, skip_window, negative_samples=1), 
    steps_per_epoch = 10000, epochs = 5
    )

atroiano commented 6 years ago

So it looks like the generator does not restart when it runs out of data to generate so after your epochs and batches become larger then your dataset, this error will occur. I will try to implement some code to handle this and share if that resolves the issue.

DavidArenburg commented 5 years ago

@atroiano Did you solve this? This looks like a bug to me so not sure why did you close it.

atroiano commented 5 years ago

It's probably a bug, I added in a line in generator code that makes the last batch = the size of the remaining data, it's been about a year since I ran this code so idk if it's still an issue in a newer version of Keras.

DavidArenburg commented 5 years ago

It is. Can you share the line you've added please?

atroiano commented 5 years ago

I've been looking for the code and I can't seem to locate it, I'll try to reproduce it this week.

On Tue, Jun 18, 2019 at 4:23 PM Daniel Falbel notifications@github.com wrote:

Reopened #377 https://github.com/rstudio/keras/issues/377.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rstudio/keras/issues/377?email_source=notifications&email_token=ADZVMWJZZSVGALFO5MZTXSLP3E74VA5CNFSM4E4U4ER2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOSBPOS5Q#event-2422139254, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZVMWL2OZ7O4PAPDXPA7D3P3E74VANCNFSM4E4U4ERQ .

atroiano commented 5 years ago

Tried to reproduce and looks like there is an issue with fit_generator.

ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays:

the output from the generator appears to be correct. It's a list containing a list of 2 arrays for the input and 1 array for the target.

I can save the first generated inputs to an object in the global workspace and train the model using fit and it works fine

dfalbel commented 5 years ago

I'll take a look at this later, but I think I had already proposed a fix for the generator here:

https://github.com/rstudio/keras/issues/740#issuecomment-495768749

atroiano commented 5 years ago

That fixed the generator, i'll try to reproduce the error I was having last year.

On Tue, Jun 25, 2019 at 4:33 AM Daniel Falbel notifications@github.com wrote:

I'll take a look at this later, but I think I had already proposed a fix for the generator here:

740 (comment)

https://github.com/rstudio/keras/issues/740#issuecomment-495768749

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rstudio/keras/issues/377?email_source=notifications&email_token=ADZVMWJDIGINIO3HKYISZT3P4HJ7FA5CNFSM4E4U4ER2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYPPILY#issuecomment-505345071, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZVMWPH7KYMRFBZ5VEACNLP4HJ7FANCNFSM4E4U4ERQ .

atroiano commented 5 years ago

@dfalbel I was able to get my old code running and it doesn't error anymore, it just hangs when the training step gets to where there is no new data.

DavidArenburg commented 5 years ago

@atroiano What Keras/TF versions you using?

msmith01 commented 5 years ago

I was getting this same error but managed to get it working again. My model looks like:

Model: "model"
___________________________________________________________________________________________________________________
Layer (type)                         Output Shape              Param #       Connected to                          
===================================================================================================================
input_1 (InputLayer)                 [(None, 1)]               0                                                   
___________________________________________________________________________________________________________________
input_2 (InputLayer)                 [(None, 1)]               0                                                   
___________________________________________________________________________________________________________________
embedding (Embedding)                (None, 1, 128)            2560128       input_1[0][0]                         
                                                                             input_2[0][0]                         
___________________________________________________________________________________________________________________
flatten (Flatten)                    (None, 128)               0             embedding[0][0]                       
___________________________________________________________________________________________________________________
flatten_1 (Flatten)                  (None, 128)               0             embedding[1][0]                       
___________________________________________________________________________________________________________________
dot (Dot)                            (None, 1)                 0             flatten[0][0]                         
                                                                             flatten_1[0][0]                       
___________________________________________________________________________________________________________________
dense (Dense)                        (None, 1)                 2             dot[0][0]                             
===================================================================================================================
Total params: 2,560,130
Trainable params: 2,560,130
Non-trainable params: 0
___________________________________________________________________________________________________________________

And my sessionInfo() is the following:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tensorflow_1.13.1.9000     tidytext_0.2.1             tm_0.7-6                   NLP_0.2-0                 
 [5] edgarWebR_1.0.0            xlsx_0.6.1                 tidyquant_0.5.6            forcats_0.4.0             
 [9] stringr_1.4.0              purrr_0.3.2                readr_1.3.1                tibble_2.1.3              
[13] tidyverse_1.2.1            quantmod_0.4-15            TTR_0.23-4                 PerformanceAnalytics_1.5.3
[17] xts_0.11-2                 zoo_1.8-6                  lubridate_1.7.4            ggrepel_0.8.1             
[21] ggforce_0.2.2              ggplot2_3.2.0              reticulate_1.12.0-9007     tidyr_0.8.3               
[25] dplyr_0.8.3                keras_2.2.4.1.9001        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        lattice_0.20-38   xlsxjars_0.6.1    assertthat_0.2.1  zeallot_0.1.0     slam_0.1-45      
 [7] R6_2.4.0          cellranger_1.1.0  backports_1.1.4   httr_1.4.0        pillar_1.4.2      tfruns_1.4       
[13] rlang_0.4.0       lazyeval_0.2.2    curl_3.3          readxl_1.3.1      rstudioapi_0.10   whisker_0.3-2    
[19] Matrix_1.2-15     polyclip_1.10-0   munsell_0.5.0     broom_0.5.2       janeaustenr_0.1.5 compiler_3.5.2   
[25] modelr_0.1.4      pkgconfig_2.0.2   base64enc_0.1-3   tidyselect_0.2.5  quadprog_1.5-7    crayon_1.3.4     
[31] withr_2.1.2       SnowballC_0.6.0   MASS_7.3-51.1     grid_3.5.2        Quandl_2.10.0     nlme_3.1-137     
[37] jsonlite_1.6      gtable_0.3.0      magrittr_1.5      tokenizers_0.2.1  scales_1.0.0      cli_1.1.0        
[43] stringi_1.4.3     farver_1.1.0      xml2_1.2.0        generics_0.0.2    tools_3.5.2       glue_1.3.1       
[49] tweenr_1.0.1      hms_0.4.2         parallel_3.5.2    colorspace_1.4-1  rvest_0.3.4       rJava_0.9-11     
[55] haven_2.1.1

EDIT: It errored out at 499.

W0723 22:48:34.745950 139648657897856 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

2019-07-23 22:48:34.978361: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Epoch 1/5
   499/100000 [..............................] - ETA: 89:39:19 - loss: 0.6743

Full error:

https://imgur.com/HCK6FWM

Back to gensim in Python :(

BioinfoMonzino commented 5 years ago

I run into the same issue and I fixed the problem, changing a little bit the generator function:

skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) {
gen <- texts_to_sequences_generator(tokenizer, sample(text))
function() {
skip=0
while( length(skip) < window_size){
skip <- generator_next(gen)
}
skip <- skipgrams(skip,
vocabulary_size = tokenizer$num_words,
window_size = window_size,
negative_samples = 1
)
x <- transpose(skip$couples) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
y <- skip$labels %>% as.matrix(ncol = 1)

list(x, y)
}
}

The error occurs when generator_next(gen) generates empty arrays.

Best, Mattia

rstudio / keras3

Text generator for Embedding Layer getting out of index error near end of training dataset #377

740 (comment)