Open Qiaoxl opened 6 years ago
Hi, I met another question about "Chinese-English" here. When I want to train a Chinese to English model, I used "--problem = translate_enzh_wmt32k_rev/ --problem = translate_enzh_wmt8k_rev" to generate data. However, I got an error which told me I should define the right "problem".
Have you met this problem?
@houyu0930: For t2t-datagen
don't use _rev
in problems names. (This is not related to the topic of this issue, which is about OOM.)
@martinpopel OK, I got a wrong idea about this. Thank you for your answer, I get it now. (Sorry for putting this question here which is not related to this topic. I will take care about this next time.)
For this topic, in fact, I generated data for "problem = translate_enzh_wmt32k" yesterday. And no error occurs here, everything looks fine for me. Are you in trouble with generating data? I don't know what you mean for "50G". In my case, total size for these generated data is 33M. Sorry for no help here. :(
@houyu0930 see the translate_enzh.py "This is far from being the real WMT17 task - only toyset here you need to register to get UN data and CWT data..." The default dataset is only 220k lines. when training with this, you won't get a good result. The whole dataset includes UN data and CWT, totally about 24000k lines. When generating with 220k lines, it takes at most about 12G memory. When generating with 24000k lines, I have got 50G memory but still not enough.
@Qiaoxl: This is strange, t2t-datagen should take just a little memory even for very big data (the shuffling is using 100 datashards by default). Can you report your T2T (and TF and Python) version? Can you re-try with the newest version? Can you try generating another small translation dataset?
@martinpopel T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.6.5 I haven't try the newest version. But I've used t2t-datagen to generate librispeech_clean_small dataset. It went well with just a little memory. Maybe you can check some detail in the log file gen_20180605_155220.log
Tagging with question for now, but if we find that there is indeed something wrong I'll change to bug.
me too. T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.5.3 output_gen.txt
me too. T2T: 1.6.6 TF:1.8.0 python:Python 3.6.3
@rsepassi hi
when i use huge Chinese datasets to generate data . it will cost a lot of memory ,but if i put the vocab files in the data forlder , it is ok . Then i think maybe some improvements are needed for subword tokenizer . it will cost a lot of memory .
Me too facing the same issue using T2T :1.6.3. Getting same as mentioned in log by @Qiaoxl . Any help on this @rsepassi ??
Try calling SubwordTextEncoder.build_from_generator directly and pass max_subtoken_length. See the docstring there. We should find a number for Zh problems that controls memory but still produces good results.
@rsepassi can I use sentencepiece to generate vocabulary and keep it in t2t_data folder ?? in that case can I simply run the t2t_trainer or any other changes are needed? and in case I simply pass some value for max_subtoken_length instead of "None" is there any ideal value for that for chinese
I would also find it helpful if there is a suggested 'max_subtoken_length' value.
I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng
It seems encode()
in data_generators/tokenizer.py
doesn't support Chinese. It cannot tokenize a Chinese sentence.
The attached patch could make it support character based tokenization.
I would also find it helpful if there is a suggested 'max_subtoken_length' value.
I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng
@echan00 did you ever find a suggested 'max_subtoken_length' value? The default of 200 seems to be very large.
Hello all, any updates on this? I also encountered the massive use of memory when try to run t2t-datagen for translate_enzh_wmt32k
https://github.com/tensorflow/tensor2tensor/issues/855#issuecomment-473559816
It seems
encode()
indata_generators/tokenizer.py
doesn't support Chinese. It cannot tokenize a Chinese sentence.The attached patch could make it support character based tokenization.
It works for me.Thanks!
Nobody got MemoryError? Problem: translate_enzh_wmt32k When generating data, with the toyset(220k lines), it takes at most about 12G memory. But with the whole dataset (about 24m lines), generating data takes really a lot memory and I got MemoryError (50G is really not enough). Has anybody tried this? How could I fix this?