google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.21k stars 1.17k forks source link

spm.SentencePieceTrainer.train stuck in Jupyter Notebook #824

Closed shivanraptor closed 1 year ago

shivanraptor commented 1 year ago

I'm trying to train SentencePiece with the following code:

import sentencepiece as spm
import re
import os
import pandas as pd

# for reading the ZIP file without extracting it
import io
import zipfile

# prepare dataset
filenames = []
text = []
with zipfile.ZipFile('blogs.zip') as zf:
    for filename in zf.namelist():
        if filename.endswith('.xml'):
            with io.TextIOWrapper(zf.open(filename), encoding='utf-8', errors='ignore') as f:
                for line in f.readlines():
                    if re.search("<", line) or len(line) < 5:
                        continue
                    else:
                        text.append(line)

with open('blog_data.txt', 'w') as fw:
    for l in text:
        fw.write(l)

# the result TXT file is 766MB

# Train SentencePiece model from Blog Corpus
spm.SentencePieceTrainer.train('--model_type=bpe --input=blog_data.txt --model_prefix=bpe --vocab_size=500 --normalization_rule_name=nmt_nfkc')

In the last line, it loads forever, and Jupyter Notebook indicates it's still running (marked as [*] in the block), but the kernel activity indicator at the top shows the Kernel is idle. No file is being generated in the directory, and no error has been generated.

What could be the reason? Is the train() still running? or stuck?

The blogs.zip comes from here

Sample data of blog_data.txt:

      Well, everyone got up and going this morning.  It's still raining, but that's okay with me.  Sort of suits my mood.  I could easily have stayed home in bed with my book and the cats.  This has been a lot of rain though!  People have wet basements, there are lakes where there should be golf courses and fields, everything is green, green, green.  But, it is supposed to be 26 degrees by Friday, so we'll be dealing with mosquitos next week.  I heard Winnipeg described as an "Old Testament" city on  urlLink CBC Radio One  last week and it sort of rings true.  Floods, infestations, etc., etc..

      My four-year old never stops talking.  She'll say "Mom?" and when I say "Yes?", she'll say "Ummm.... ummm... oh yeah.  Where do lady bugs hide in the rain?"  Anything to hear her own voice. Very, very exhausting.    Now I remember!  This is why I go to work!   *Sigh*

      Actually it's not raining yet, but I bought 15 tickets to the  urlLink Goldeyes  game for my Mom's birthday tonight, and it is supposed to rain.  Do they cancel baseball games because of rain?  Although the ballpark is beautiful, it ain't the  urlLink SkyDome .  We used to go to the Jays games occassionally when we lived in Toronto and really like taking the kids to the Goldeyes games now.  I don't know what  urlLink Blue Jays  tickets cost now, but I'm sure it's cheaper here in Winnipeg.  Oh, I just checked and it  definitely  is!

      Ha! Just set up my RSS feed - that is so easy!  Why doesn't everyone do it?  Enough for today.  The sun is shining and I should be outside planting my poor flowers (that have spent far too long in their pots) but I have 3 kindergartners and a preschooler who are refusing to go outside.  Little gameboy junkies...  I should talk!  Last post today, I promise.
taku910 commented 1 year ago

Due to a lack of information, the cause of the problem cannot be determined at this stage. In the meantime, you might want to try to isolate the problem by trying to see if it works in a vanilla Linux/Python environment or C++ CL tool (spm_train).

taku910 commented 1 year ago

If there are no further discussions, we will automatically close this bug on May 1.