tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

Question: out of memory when generating translate_enzh_wmt32k data set #855

Open Qiaoxl opened 6 years ago

Qiaoxl commented 6 years ago

Nobody got MemoryError? Problem: translate_enzh_wmt32k When generating data, with the toyset(220k lines), it takes at most about 12G memory. But with the whole dataset (about 24m lines), generating data takes really a lot memory and I got MemoryError (50G is really not enough). Has anybody tried this? How could I fix this?

houyu0930 commented 6 years ago

Hi, I met another question about "Chinese-English" here. When I want to train a Chinese to English model, I used "--problem = translate_enzh_wmt32k_rev/ --problem = translate_enzh_wmt8k_rev" to generate data. However, I got an error which told me I should define the right "problem".

Have you met this problem?

martinpopel commented 6 years ago

@houyu0930: For t2t-datagen don't use _rev in problems names. (This is not related to the topic of this issue, which is about OOM.)

houyu0930 commented 6 years ago

@martinpopel OK, I got a wrong idea about this. Thank you for your answer, I get it now. (Sorry for putting this question here which is not related to this topic. I will take care about this next time.)

For this topic, in fact, I generated data for "problem = translate_enzh_wmt32k" yesterday. And no error occurs here, everything looks fine for me. Are you in trouble with generating data? I don't know what you mean for "50G". In my case, total size for these generated data is 33M. Sorry for no help here. :(

Qiaoxl commented 6 years ago

@houyu0930 see the translate_enzh.py "This is far from being the real WMT17 task - only toyset here you need to register to get UN data and CWT data..." The default dataset is only 220k lines. when training with this, you won't get a good result. The whole dataset includes UN data and CWT, totally about 24000k lines. When generating with 220k lines, it takes at most about 12G memory. When generating with 24000k lines, I have got 50G memory but still not enough.

martinpopel commented 6 years ago

@Qiaoxl: This is strange, t2t-datagen should take just a little memory even for very big data (the shuffling is using 100 datashards by default). Can you report your T2T (and TF and Python) version? Can you re-try with the newest version? Can you try generating another small translation dataset?

Qiaoxl commented 6 years ago

@martinpopel T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.6.5 I haven't try the newest version. But I've used t2t-datagen to generate librispeech_clean_small dataset. It went well with just a little memory. Maybe you can check some detail in the log file gen_20180605_155220.log

rsepassi commented 6 years ago

Tagging with question for now, but if we find that there is indeed something wrong I'll change to bug.

robotzheng commented 6 years ago

me too. T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.5.3 output_gen.txt

xuekun90 commented 6 years ago

me too. T2T: 1.6.6 TF:1.8.0 python:Python 3.6.3

hpulfc commented 6 years ago

@rsepassi hi

when i use huge Chinese datasets to generate data . it will cost a lot of memory ,but if i put the vocab files in the data forlder , it is ok . Then i think maybe some improvements are needed for subword tokenizer . it will cost a lot of memory .

sugeeth14 commented 6 years ago

Me too facing the same issue using T2T :1.6.3. Getting same as mentioned in log by @Qiaoxl . Any help on this @rsepassi ??

rsepassi commented 6 years ago

Try calling SubwordTextEncoder.build_from_generator directly and pass max_subtoken_length. See the docstring there. We should find a number for Zh problems that controls memory but still produces good results.

sugeeth14 commented 6 years ago

@rsepassi can I use sentencepiece to generate vocabulary and keep it in t2t_data folder ?? in that case can I simply run the t2t_trainer or any other changes are needed? and in case I simply pass some value for max_subtoken_length instead of "None" is there any ideal value for that for chinese

echan00 commented 6 years ago

I would also find it helpful if there is a suggested 'max_subtoken_length' value.

I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng

torshie commented 5 years ago

It seems encode() in data_generators/tokenizer.py doesn't support Chinese. It cannot tokenize a Chinese sentence.

The attached patch could make it support character based tokenization.

cjk.txt

Santosh-Gupta commented 5 years ago

I would also find it helpful if there is a suggested 'max_subtoken_length' value.

I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng

@echan00 did you ever find a suggested 'max_subtoken_length' value? The default of 200 seems to be very large.

timxzz commented 5 years ago

Hello all, any updates on this? I also encountered the massive use of memory when try to run t2t-datagen for translate_enzh_wmt32k

qpzhao commented 4 years ago

https://github.com/tensorflow/tensor2tensor/issues/855#issuecomment-473559816

It seems encode() in data_generators/tokenizer.py doesn't support Chinese. It cannot tokenize a Chinese sentence.

The attached patch could make it support character based tokenization.

cjk.txt

It works for me.Thanks!