nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

Input data file only 24M, but cost about 90G memory while converting to unicode in the load_20newsgroups.py step #12

Closed lincarlos closed 5 years ago

lincarlos commented 5 years ago

I replaced the 20newsgroups data file to my own data file which size is about 24M, when I ran the load_20newsgroups.py, it cost about 90G memory, why this implementation costs such big memory???

nateraw commented 5 years ago

Wow that's a lot of memory. Are you talking about in memory usage or storage size?? That seems really high.

I will say, the implementation I've used in Spacy is computationally expensive. If we could figure out a smarter way to map Spacy hashes to embedding matrix index values, that would greatly reduce memory usage and preprocessing time. This is definitely the bottleneck in the preprocessing. I've found no way to do this online, so I hacked together my own way in the nlppipe.py file.

lincarlos commented 5 years ago

Wow that's a lot of memory. Are you talking about in memory usage or storage size?? That seems really high.

I will say, the implementation I've used in Spacy is computationally expensive. If we could figure out a smarter way to map Spacy hashes to embedding matrix index values, that would greatly reduce memory usage and preprocessing time. This is definitely the bottleneck in the preprocessing. I've found no way to do this online, so I hacked together my own way in the nlppipe.py file.

I found that you init the matrix with n*m size, where n is the document size and m is the hyperparameter, the matrix cost too much memory... And I have a question about lda2vec, Is the preprocess step generating the lda and word2vec results, and then combine these result to train the lda2vec model in running process ? I read the paper, but did not understand where lda and word2vec results coming from.

lincarlos commented 5 years ago

Is there any spark version implementation of lda2vec?

nateraw commented 5 years ago

No there is no spark implementation. Not sure what you mean by m. M is the embedding size (as far as I understand from what you said). No way to make this smaller. Thing that might affect lda2vec would be preprocessing. I don't know how to program in spark, but I am open to learning it if it applies

lincarlos commented 5 years ago

No there is no spark implementation. Not sure what you mean by m. M is the embedding size (as far as I understand from what you said). No way to make this smaller. Thing that might affect lda2vec would be preprocessing. I don't know how to program in spark, but I am open to learning it if it applies

yeah, m is the embedding size. Is the preprocess step generating the lda and word2vec results, and then combine these result to train the lda2vec model in running process ? I read the paper, but did not understand where lda and word2vec results coming from. Is the preprocess step generating the w2v and raw lda results?

dbl001 commented 5 years ago

Try reducing batch size from 10,000 to 500 in nlppipe.py in the tokenizer() method on line #184:

E.g. -

for row, doc in enumerate(self.nlp.pipe(self.texts, n_threads=self.num_threads, batch_size=500)):

nateraw commented 5 years ago

@dbl001 thank you!!! Totally forgot about that.

dbl001 commented 5 years ago

At batch_size= 500, my test file of news stories consume > 8gb of DRAM.

$ ls -l stories.txt -rw-r--r-- 1 davidlaxer staff 68793151 Nov 8 08:48 stories.txt $ wc stories.txt 5172 10169765 68793151 stories.txt

On Nov 9, 2018, at 10:43 AM, Nathan Raw notifications@github.com wrote:

@dbl001 https://github.com/dbl001 thank you!!! Totally forgot about that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437455988, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2znAucfn451RwB-uSv-yeVxJQGaMks5utczYgaJpZM4YKtIy.

nateraw commented 5 years ago

@dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.

dbl001 commented 5 years ago

Investigating …

$ python load_stories.py Using TensorFlow backend. It took 13812.14863204956 seconds to run tokenizer method

On Nov 9, 2018, at 12:04 PM, Nathan Raw notifications@github.com wrote:

@dbl001 https://github.com/dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437479346, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i20ILW3RkHOoE5KJ7mP_Or_217wJLks5utd-ygaJpZM4YKtIy.

dbl001 commented 5 years ago

Converting to skip grams used ~18gb of DRAM.

converting to skipgrams step 0 of 5172 step 500 of 5172 step 1000 of 5172 step 1500 of 5172 step 2000 of 5172 step 2500 of 5172 step 3000 of 5172 step 3500 of 5172 step 4000 of 5172 step 4500 of 5172 step 5000 of 5172

On Nov 9, 2018, at 2:24 PM, David Laxer davidl@softintel.com wrote:

Investigating …

$ python load_stories.py Using TensorFlow backend. It took 13812.14863204956 seconds to run tokenizer method

On Nov 9, 2018, at 12:04 PM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:

@dbl001 https://github.com/dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437479346, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i20ILW3RkHOoE5KJ7mP_Or_217wJLks5utd-ygaJpZM4YKtIy.

schneeLee commented 5 years ago

@dbl001 @nateraw I changed the batch size to 500 in file nlppipe.py on line #184 and got a "KeyError". so I changed the batch size to 500 on line #282 too, but I still got the same error. the error message is shown blow:

<class 'KeyError'> nlppipe.py 199 Traceback (most recent call last): File "E:/python/Lda2vec-Tensorflow/tests/twenty_newsgroups/load_20newsgroups.py", line 21, in utils.run_preprocessing(texts, data_dir, run_name, bad=bad, max_length=10000) File "E:\python\Lda2vec-Tensorflow\lda2vec\utils.py", line 149, in run_preprocessing token_type=token_type, vectors=vectors, merge=merge) File "E:\python\Lda2vec-Tensorflow\lda2vec\nlppipe.py", line 103, in init self.tokenize() File "E:\python\Lda2vec-Tensorflow\lda2vec\nlppipe.py", line 184, in tokenize for row, doc in enumerate(self.nlp.pipe(self.texts, n_threads=self.num_threads, batch_size=2000)): File "D:\Program Files\python\lib\site-packages\spacy\language.py", line 557, in pipe recentstrings.add(word.text) File "token.pyx", line 171, in spacy.tokens.token.Token.text.get File "token.pyx", line 677, in spacy.tokens.token.Token.orth.get File "strings.pyx", line 116, in spacy.strings.StringStore.getitem KeyError: 16483550085100257219.

how can I fix it?

nateraw commented 5 years ago

@schneeLee the line you need to change is the batch size parameter of the nlp.pipe function within nlppipe.py. I need to update this to be more easily changed.

dbl001 commented 5 years ago

You might also consider adjusting ‘write_every’ (e.g. - write_every=100) when invoking utils.run_preprocessing() if you have a large input file.

On Nov 27, 2018, at 11:03 PM, Nathan Raw notifications@github.com wrote:

@schneeLee https://github.com/schneeLee the line you need to change is the batch size parameter of the nlp.pipe function within nlppipe.py. I need to update this to be more easily changed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442341686, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2zwPbr0emLLIyQCqjat4jbRln0qhks5uzjVHgaJpZM4YKtIy.

nateraw commented 5 years ago

@dbl001 you know my code better than me! I wonder if there is a better way to write that section too. The "write_every" fix was a quick and dirty solution, I feel like.

dbl001 commented 5 years ago

Collecting all the data into a list and outputting the entire list kept running out of memory (even replacing the data frame with file I/O: E.g.

   import csv

    with open(file_out_path + "/skipgrams.txt", 'wb') as myfile:
            wr = csv.writer(myfile, delimiter="\t", quoting=csv.QUOTE_ALL)
            wr.writerow(data)

On Nov 28, 2018, at 9:26 AM, Nathan Raw notifications@github.com wrote:

@dbl001 https://github.com/dbl001 you know my code better than me! I wonder if there is a better way to write that section too. The "write_every" fix was a quick and dirty solution, I feel like.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442532727, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i25D43pzYbYukbkZ4ZnwhrP8Mcbz2ks5uzsdFgaJpZM4YKtIy.

nateraw commented 5 years ago

Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)

dbl001 commented 5 years ago

The code below ran out of memory on a 5,000 line input file with 10 million rows. So, the incremental write is probably best. Is creating a data frame to output the CSV less efficient then doing python file I/O?
Probably, but it may not make a noticeable difference.

$wc data/stories.txt 5172 10128815 68793151 data/stories.txt

I’m currently running line profiling on utils.run_processing(). Stay tuned.

On Nov 28, 2018, at 10:13 AM, Nathan Raw notifications@github.com wrote:

Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442549183, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2ym0p4qRyfrnTjanLNC8u8MTBQ7vks5uztJKgaJpZM4YKtIy.

dbl001 commented 5 years ago

(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ cat profile.out Timer unit: 1e-06 s

Total time: 87014.2 s File: /Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/utils.py Function: run_preprocessing at line 91

Line # Hits Time Per Hit % Time Line Contents

91                                           @profile
92                                           def run_preprocessing(texts, data_dir, run_name, min_freq_threshold=10,
93                                                                 max_length=100, bad=[], vectors="en_core_web_lg",
94                                                                 num_threads=2, token_type="lemma", only_keep_alpha=False,
95                                                                 write_every=10000, merge=False):
96                                               """This function abstracts the rest of the preprocessing needed
97                                               to run Lda2Vec in conjunction with the NlpPipeline
98                                           
99                                               Parameters

100 ---------- 101 texts : TYPE 102 Python list of text 103 data_dir : TYPE 104 directory where your data is held 105 run_name : TYPE 106 Name of sub-directory to be created that will hold preprocessed data 107 min_freq_threshold : int, optional 108 If words occur less frequently than this threshold, then purge them from the docs 109 max_length : int, optional 110 Length to pad/cut off sequences 111 bad : list, optional 112 List or Set of words to filter out of dataset 113 vectors : str, optional 114 Name of vectors to load from spacy (Ex. "en", "en_core_web_sm") 115 num_threads : int, optional 116 Number of threads used in spacy pipeline 117 token_type : str, optional 118 Type of tokens to keep (Options: "lemma", "lower", "orth") 119 only_keep_alpha : bool, optional 120 Only keep alpha characters 121 write_every : int, optional 122 Number of documents' data to store before writing cache to skipgrams file 123 merge : bool, optional 124 Merge noun phrases or not 125 """ 126
127 1 15572.0 15572.0 0.0 def clean(line): 128 return ' '.join(w for w in line.split() if not any(t in w for t in bad)) 129
130 # Location for preprocessed data to be stored 131 1 8.0 8.0 0.0 file_out_path = data_dir + "/" + run_name 132
133 1 2904.0 2904.0 0.0 if not os.path.exists(file_out_path): 134
135 # Make directory to save data in 136 1 198.0 198.0 0.0 os.makedirs(file_out_path) 137
138 # Remove tokens with these substrings 139 1 9.0 9.0 0.0 bad = set(bad) 140
141 # Preprocess data 142
143 # Convert to unicode (spaCy only works with unicode) 144 1 74906607.0 74906607.0 0.1 texts = [str(clean(d)) for d in texts] 145
146 # Process the text, no file because we are passing in data directly 147 1 11.0 11.0 0.0 SP = NlpPipeline(None, max_length, texts=texts, 148 1 5.0 5.0 0.0 num_threads=num_threads, only_keep_alpha=only_keep_alpha, 149 1 12700477964.0 12700477964.0 14.6 token_type=token_type, vectors=vectors, merge=merge) 150
151 # Computes the embed matrix along with other variables 152 1 61100298107.0 61100298107.0 70.2 SP._compute_embed_matrix() 153
154 1 6620.0 6620.0 0.0 print("converting data to w2v indexes") 155 # Convert data to word2vec indexes 156 1 256696370.0 256696370.0 0.3 SP.convert_data_to_word2vec_indexes() 157
158 1 2465.0 2465.0 0.0 print("trimming 0's") 159 # Trim zeros from idx data 160 1 293081773.0 293081773.0 0.3 SP.trim_zeros_from_idx_data() 161
162 # This extracts the length of each document (needed for pyldaviz) 163 1 79761.0 79761.0 0.0 doc_lengths = [len(x) for x in SP.idx_data] 164
165 # Find the cutoff idx 166 33365 162720.0 4.9 0.0 for i, freq in enumerate(SP.freqs): 167 33365 154817.0 4.6 0.0 if freq < min_freq_threshold: 168 1 3.0 3.0 0.0 cutoff = i 169 1 5.0 5.0 0.0 break 170 # Then, cut off the embed matrix 171 1 43.0 43.0 0.0 embed_matrix = SP.embed_matrix[:cutoff] 172 # Also, replace all tokens below cutoff in idx_data 173 5173 34349.0 6.6 0.0 for i in range(len(SP.idx_data)): 174 5172 2035591.0 393.6 0.0 SP.idx_data[i][SP.idx_data[i] > cutoff - 1] = 0 175 # Next, cut off the frequencies 176 1 1562.0 1562.0 0.0 freqs = SP.freqs[:cutoff] 177
178 1 84.0 84.0 0.0 print("converting to skipgrams") 179
180 1 3.0 3.0 0.0 data = [] 181 1 31.0 31.0 0.0 num_examples = SP.idx_data.shape[0] 182 # Sometimes docs can be less than the required amount for 183 # the skipgram function. So, we must manually make a counter 184 # instead of relying on the enumerated index (i) 185 1 3.0 3.0 0.0 doc_id_counter = 0 186 # Additionally, we will keep track of these lower level docs 187 # and will purge them later 188 1 3.0 3.0 0.0 purged_docs = [] 189 5173 679517.0 131.4 0.0 for i, t in enumerate(SP.idxdata): 190 5172 30736.0 5.9 0.0 pairs, = skipgrams(t, 191 5172 31826.0 6.2 0.0 vocabulary_size=SP.vocab_size, 192 5172 17912.0 3.5 0.0 window_size=5, 193 5172 15887.0 3.1 0.0 shuffle=True, 194 5172 5047109332.0 975852.5 5.8 negative_samples=0) 195 # Pairs will be 0 if document is less than 2 indexes 196 5172 76872.0 14.9 0.0 if len(pairs) > 2: 197 109260975 444886451.0 4.1 0.5 for pair in pairs: 198 109255804 401964381.0 3.7 0.5 temp_data = pair 199 # Appends doc ID 200 109255804 543955438.0 5.0 0.6 temp_data.append(doc_id_counter) 201 # Appends document index 202 109255804 422388218.0 3.9 0.5 temp_data.append(i) 203 109255804 436079510.0 4.0 0.5 data.append(temp_data) 204 5171 27482.0 5.3 0.0 doc_id_counter += 1 205 else: 206 1 4.0 4.0 0.0 purged_docs.append(i) 207 5172 22056.0 4.3 0.0 if i // write_every: 208 temp_df = pd.DataFrame(data) 209 temp_df.to_csv(file_out_path + "/skipgrams.txt", sep="\t", index=False, header=None, mode="a") 210 del temp_df 211 data = [] 212 5172 22860.0 4.4 0.0 if i % 500 == 0: 213 11 355847.0 32349.7 0.0 print("step", i, "of", num_examples) 214 1 4352734484.0 4352734484.0 5.0 temp_df = pd.DataFrame(data) 215 1 925359541.0 925359541.0 1.1 temp_df.to_csv(file_out_path + "/skipgrams.txt", sep="\t", index=False, header=None, mode="a") 216 1 494780.0 494780.0 0.0 del temp_df 217
218 # Save embed matrix 219 1 8285844.0 8285844.0 0.0 np.save(file_out_path + "/embed_matrix", embed_matrix) 220 # Save the doc lengths to be used later, also, purge those that didnt make it into skipgram function 221 1 124899.0 124899.0 0.0 np.save(file_out_path + "/doc_lengths", np.delete(doc_lengths, np.array(purged_docs))) 222 # Save frequencies to file 223 1 197430.0 197430.0 0.0 np.save(file_out_path + "/freqs", freqs) 224 # Save vocabulary dictionaries to file 225 1 229.0 229.0 0.0 idx_to_word_out = open(file_out_path + "/" + "idx_to_word.pickle", "wb") 226 1 1024778.0 1024778.0 0.0 pickle.dump(SP.idx_to_word, idx_to_word_out) 227 1 76339.0 76339.0 0.0 idx_to_word_out.close() 228 1 235.0 235.0 0.0 word_to_idx_out = open(file_out_path + "/" + "word_to_idx.pickle", "wb") 229 1 202652.0 202652.0 0.0 pickle.dump(SP.word_to_idx, word_to_idx_out) 230 1 92133.0 92133.0 0.0 word_to_idx_out.close()

(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$

On Nov 28, 2018, at 10:41 AM, David Laxer davidl@softintel.com wrote:

The code below ran out of memory on a 5,000 line input file with 10 million rows. So, the incremental write is probably best. Is creating a data frame to output the CSV less efficient then doing python file I/O?
Probably, but it may not make a noticeable difference.

$wc data/stories.txt 5172 10128815 68793151 data/stories.txt

I’m currently running line profiling on utils.run_processing(). Stay tuned.

On Nov 28, 2018, at 10:13 AM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:

Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442549183, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2ym0p4qRyfrnTjanLNC8u8MTBQ7vks5uztJKgaJpZM4YKtIy.

schneeLee commented 5 years ago

after I changeed the batch size parameter of the nlp.pipe function to 500, I still got the "KeyError" which I mentioned before. But when I cut down the 20_newsgroups corpus to a smaller size, it works. Is there anything special in corpus or in code?

dbl001 commented 5 years ago

Is this self.vocabulary, computed in the function: compute_embed_matrix() in nlppipe.py being use

Append word onto unique vocabulary list

self.vocabulary = np.append(self.vocabulary, word)

It’s taking 98.5% of the time in compute_embed_matrix(): E.g. - 506 150836 68137415647.0 451731.8 98.5 self.vocabulary = np.append(self.vocabulary, word)

numpy appends require a physical copy (because the use contiguous memory). If self.vocabulary is required to be a unique list, then appends to python lists are faster and more efficient. Once the unique list is generated, you can generate a numpy array.

On Nov 28, 2018, at 5:25 PM, schneeLee notifications@github.com wrote:

after I changeed the batch size parameter of the nlp.pipe function to 500, I still got the "KeyError" which I mentioned before. But when I cut down the 20_newsgroups corpus to a smaller size, it works. Is there anything special in corpus or in code?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442670286, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i27yxowpahviiCEBbYw8JA4TZoLpWks5uzzeSgaJpZM4YKtIy.

schneeLee commented 5 years ago

@dbl001 thanks for the help. I forgot to adding the "texts" in the first line of the corpus. and there are some other questions: how can I get top N topics for each doc and the topN words for each topic after the train process is finished? and how to predict the topics of the unseen docs?

nateraw commented 5 years ago

Working on fixing this here

hassant4 commented 5 years ago

@dbl001 @nateraw I changed the batch size to 500 in file nlppipe.py on line #184 and got a "KeyError". so I changed the batch size to 500 on line #282 too, but I still got the same error. the error message is shown blow:

Did you manage to find a solution? Same KeyError.

nateraw commented 5 years ago

Pushing new preprocessing today. Looking back, I have no idea why I made it so complicated. Much easier to understand version coming shortly.

nateraw commented 5 years ago

Updated. Closing this issue.