mgrankin / ru_transformers

Apache License 2.0
776 stars 108 forks source link

num_samples should be a positive integer value, but got num_samples=0 #27

Closed graynk closed 4 years ago

graynk commented 4 years ago

Hi. I'm extremely new to neural networks, but I've managed to set up TPU (my RTX 2060 was not enough it seems, since I was getting OOMs) and get the python /pytorch/xla/test/test_train_mp_mnist.py test running. I've downloaded a medium untuned Russian model and put it into $OUTPUT. Now I want to finetune it to a log of chat messages (about 1 million messages, ~50MB text file). I have a few questions that were not clear to me in the README

  1. Should I just use the original text file or the one that I can get using corpus notebook?
  2. The validation file is just a plain simple text that looks like something I'd like to get at the end when generating text, is that correct?
  3. No matter what train file I choose I get this error:
    06/12/2020 15:29:23 - INFO - __mp_main__ -   Loading features from ./data/full/chat.txt  
    Exception in device=TPU:1: num_samples should be a positive integer value, but got num_samples=0

    The folder cached does get created with some file in it, but it fails to extract the features from my dataset, as I understand it. What could be the reason, do I need to prepare the text in some other way? I also had to comment out these two lines in yt_encoder.py, since I was getting "Setting 'max_len_single_sentence' is now deprecated. This value is automatically set up" error:

    self.max_len_single_sentence = 1024
    self.max_len_sentences_pair = 1024

    Could that be the reason, and if it is, how do I get rid of the initial error without breaking things? Thanks

mgrankin commented 4 years ago

Hello, @graynk

  1. There is some cleaning in corpus notebook. It's useful, but not necessarily.
  2. Yes. It shouldn't be a part of the training set, so you'll see how you model performs on the unseen data. When it stops improving or getting worse - stop the training.
  3. I remember that bug. It can't build cached dataset in multithread mode. Try running the code in one thread. It will generate cache for all files. Then you can run it multithreaded.

I didn't have deprecation warnings, can't help you with that.

graynk commented 4 years ago

It was caused by my commenting out the deprecated code indeed. I was not able to run it in single threaded mode (it ignored CUDA_VISIBLE_DEVICES and --no_cuda for some reason even though it worked on my own machine), so I made the cache on my machine, put it on the google cloud and even hardcoded the path to the cache. It gave a different error, but then I figured that if I had to hardcode the path to the cache then it means that it generates wrong cache name from the same values (instead of 1024 it was some huge number).

So long story short, you need transformers==2.2.0, tensorboard==1.15.0 and fastai==1.0.59. WIth those versions it works with no changes to the code, maybe it should be added to tpu_requirements.txt

mgrankin commented 4 years ago

Thank you, I'll update the requirements file.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.