gunthercox / ChatterBot

ChatterBot is a machine learning, conversational dialog engine for creating chat bots
https://chatterbot.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13.96k stars 4.42k forks source link

Training with UbuntuCorpusTrainer fails #2341

Open AlkisPis opened 7 months ago

AlkisPis commented 7 months ago

It seems I can't train the chatbot with the UbuntuCorpusTrainer. I tried multiple times and, after the first time, in which the trainer downloaded the TGZ file and extracted the TSV files, from thereon I was always receiving the following info: INFO:chatterbot.chatterbot:File is already downloaded INFO:chatterbot.chatterbot:File is already extracted Then the script was stopped responding.

Questions: 1) What could the problem be? 2) The file 'Ubuntu_dialogs.tgz' , which I managed to download myself, contains thousands of TSV files. Where have they been extracted to or converted to ands stored as YML? They can't be found under the 'chatterbot_corpus' folder or anywhere else.

AlkisPis commented 7 months ago

I have debugged the training process of UbuntuCorpusTrainer and found out that it had extracted the TSV files into 'C:\Users\user\ubuntu_data\ubuntu_dialogs\dialogs folder' (in Windows) . This is totally unacceptable, i.e. using a folder in the main disk of the user instead of the folder in which ChatterBot has been created, as with the other corpus data! And it is more unacceptable if one has installed ChatterBot in movable disk, like a flash drive. Because there is a very obvious reason why someone has chosen this a kind of installation: to be able to be used as stand-alone!
Then I realized that the trainer tries to create a DB from exactly 23251 files stored in that folder. And of course the process gets "stuck" and it looks like the script has crashed. No one can know why until one debugs the training process! Totally UNACCEPTABLE! Both the method of extracting the files and the never-ending training of the ChatBot with such a kind of Trainer, esp. considering that it does not even accept a pathspec, as e.g. ChatterBotCorpusTrainer with which you can use even a single YAML file for training.

Any comments are welcome. I will keep this issue open just for a few days.