Open atroiano opened 6 years ago
So it looks like the generator does not restart when it runs out of data to generate so after your epochs and batches become larger then your dataset, this error will occur. I will try to implement some code to handle this and share if that resolves the issue.
@atroiano Did you solve this? This looks like a bug to me so not sure why did you close it.
It's probably a bug, I added in a line in generator code that makes the last batch = the size of the remaining data, it's been about a year since I ran this code so idk if it's still an issue in a newer version of Keras.
It is. Can you share the line you've added please?
I've been looking for the code and I can't seem to locate it, I'll try to reproduce it this week.
On Tue, Jun 18, 2019 at 4:23 PM Daniel Falbel notifications@github.com wrote:
Reopened #377 https://github.com/rstudio/keras/issues/377.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rstudio/keras/issues/377?email_source=notifications&email_token=ADZVMWJZZSVGALFO5MZTXSLP3E74VA5CNFSM4E4U4ER2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOSBPOS5Q#event-2422139254, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZVMWL2OZ7O4PAPDXPA7D3P3E74VANCNFSM4E4U4ERQ .
Tried to reproduce and looks like there is an issue with fit_generator.
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays:
the output from the generator appears to be correct. It's a list containing a list of 2 arrays for the input and 1 array for the target.
I can save the first generated inputs to an object in the global workspace and train the model using fit and it works fine
I'll take a look at this later, but I think I had already proposed a fix for the generator here:
https://github.com/rstudio/keras/issues/740#issuecomment-495768749
That fixed the generator, i'll try to reproduce the error I was having last year.
On Tue, Jun 25, 2019 at 4:33 AM Daniel Falbel notifications@github.com wrote:
I'll take a look at this later, but I think I had already proposed a fix for the generator here:
740 (comment)
https://github.com/rstudio/keras/issues/740#issuecomment-495768749
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rstudio/keras/issues/377?email_source=notifications&email_token=ADZVMWJDIGINIO3HKYISZT3P4HJ7FA5CNFSM4E4U4ER2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYPPILY#issuecomment-505345071, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZVMWPH7KYMRFBZ5VEACNLP4HJ7FANCNFSM4E4U4ERQ .
@dfalbel I was able to get my old code running and it doesn't error anymore, it just hangs when the training step gets to where there is no new data.
@atroiano What Keras/TF versions you using?
I was getting this same error but managed to get it working again. My model looks like:
Model: "model"
___________________________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
===================================================================================================================
input_1 (InputLayer) [(None, 1)] 0
___________________________________________________________________________________________________________________
input_2 (InputLayer) [(None, 1)] 0
___________________________________________________________________________________________________________________
embedding (Embedding) (None, 1, 128) 2560128 input_1[0][0]
input_2[0][0]
___________________________________________________________________________________________________________________
flatten (Flatten) (None, 128) 0 embedding[0][0]
___________________________________________________________________________________________________________________
flatten_1 (Flatten) (None, 128) 0 embedding[1][0]
___________________________________________________________________________________________________________________
dot (Dot) (None, 1) 0 flatten[0][0]
flatten_1[0][0]
___________________________________________________________________________________________________________________
dense (Dense) (None, 1) 2 dot[0][0]
===================================================================================================================
Total params: 2,560,130
Trainable params: 2,560,130
Non-trainable params: 0
___________________________________________________________________________________________________________________
And my sessionInfo() is the following:
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tensorflow_1.13.1.9000 tidytext_0.2.1 tm_0.7-6 NLP_0.2-0
[5] edgarWebR_1.0.0 xlsx_0.6.1 tidyquant_0.5.6 forcats_0.4.0
[9] stringr_1.4.0 purrr_0.3.2 readr_1.3.1 tibble_2.1.3
[13] tidyverse_1.2.1 quantmod_0.4-15 TTR_0.23-4 PerformanceAnalytics_1.5.3
[17] xts_0.11-2 zoo_1.8-6 lubridate_1.7.4 ggrepel_0.8.1
[21] ggforce_0.2.2 ggplot2_3.2.0 reticulate_1.12.0-9007 tidyr_0.8.3
[25] dplyr_0.8.3 keras_2.2.4.1.9001
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 lattice_0.20-38 xlsxjars_0.6.1 assertthat_0.2.1 zeallot_0.1.0 slam_0.1-45
[7] R6_2.4.0 cellranger_1.1.0 backports_1.1.4 httr_1.4.0 pillar_1.4.2 tfruns_1.4
[13] rlang_0.4.0 lazyeval_0.2.2 curl_3.3 readxl_1.3.1 rstudioapi_0.10 whisker_0.3-2
[19] Matrix_1.2-15 polyclip_1.10-0 munsell_0.5.0 broom_0.5.2 janeaustenr_0.1.5 compiler_3.5.2
[25] modelr_0.1.4 pkgconfig_2.0.2 base64enc_0.1-3 tidyselect_0.2.5 quadprog_1.5-7 crayon_1.3.4
[31] withr_2.1.2 SnowballC_0.6.0 MASS_7.3-51.1 grid_3.5.2 Quandl_2.10.0 nlme_3.1-137
[37] jsonlite_1.6 gtable_0.3.0 magrittr_1.5 tokenizers_0.2.1 scales_1.0.0 cli_1.1.0
[43] stringi_1.4.3 farver_1.1.0 xml2_1.2.0 generics_0.0.2 tools_3.5.2 glue_1.3.1
[49] tweenr_1.0.1 hms_0.4.2 parallel_3.5.2 colorspace_1.4-1 rvest_0.3.4 rJava_0.9-11
[55] haven_2.1.1
EDIT: It errored out at 499.
W0723 22:48:34.745950 139648657897856 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
2019-07-23 22:48:34.978361: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Epoch 1/5
499/100000 [..............................] - ETA: 89:39:19 - loss: 0.6743
Full error:
Back to gensim in Python :(
I run into the same issue and I fixed the problem, changing a little bit the generator function:
skipgrams_generator <- function(text, tokenizer, window_size, negative_samples) {
gen <- texts_to_sequences_generator(tokenizer, sample(text))
function() {
skip=0
while( length(skip) < window_size){
skip <- generator_next(gen)
}
skip <- skipgrams(skip,
vocabulary_size = tokenizer$num_words,
window_size = window_size,
negative_samples = 1
)
x <- transpose(skip$couples) %>% map(. %>% unlist %>% as.matrix(ncol = 1))
y <- skip$labels %>% as.matrix(ncol = 1)
list(x, y)
}
}
The error occurs when generator_next(gen) generates empty arrays.
Best, Mattia
I am creating a custom embedding layer for a corpus of high dimensional codes,. I am using the code on this link as an example:
https://tensorflow.rstudio.com/blog/word-embeddings-with-keras.html
My only change is the window_size is 3. Everything is appears to be working correctly if I stop the model before getting within 3 text sequences of the end of my dataset. When it gets near the end, it errors and gives me an out of index.
Error in py_call_impl(callable, dots$args, dots$keywords) : IndexError: list index out of range.
For example: My total dataset is 46193 observations and the code errors on the 5th epoch at 6190 with 10000 samples per epoch, so it errors on record 46190. I have seen it error on the last sample, so it errors when it hits record 46193.
Another factor, some of my words have punctuation in them. For example a sentence could be: 88152 C59.20 C23.20 None None None None