kefirski / pytorch_RVAE

Recurrent Variational Autoencoder that generates sequential data implemented with pytorch
MIT License
357 stars 87 forks source link

train.py memory problem #1

Open transfluxus opened 7 years ago

transfluxus commented 7 years ago

is there a way to use a word embedding genereted with something else (like gensim for example). This implementation dies after a while on my relatively large data set (with 32gb of memory)

kefirski commented 7 years ago

What do you mean by "dies after a while"? There are no restricts on the nature of word embeddings –– you just have to save it in appropriate file and Embedding module will pick up them

transfluxus commented 7 years ago

it says. 'killed' after 20 minutes max

transfluxus commented 7 years ago

the output of train are several files: characters_vocab.pkl, train_character_tensor.npy, train_word_tensor.npy, valid_word_tensor.npy, words_vocab.pkl, valid_character_tensor.npy and word_embeddings.npy which one do I need for the next steps?

xushenkun commented 7 years ago

I think "dies after a while" is because the seq_len is too long. I have encountered this sometimes and it's alright after I reduced the length of each corpus sentence.

transfluxus commented 7 years ago

interesting. It's a while ago so I don't remember if I used a sentence of a whole document as a sentence. but I guess i used sentences, so how would I chop them?

xushenkun commented 7 years ago

@transfluxus I used Chinese corpus and it should be less than 300 words in each sentence; or crashed. I think it should be less than 1000 words for English corpus. I just split the sentence when encountered commas or full stops.

transfluxus commented 7 years ago

i limited the sentence length to 100, still doesn't run through. actually already the train_word_embedding fails. loading the whole corpus and then creating multiple representations of it is not really practical if your corpus has a real size (4.2mio sentences in my case). it's gotta be streamed