Closed lincarlos closed 5 years ago
Wow that's a lot of memory. Are you talking about in memory usage or storage size?? That seems really high.
I will say, the implementation I've used in Spacy is computationally expensive. If we could figure out a smarter way to map Spacy hashes to embedding matrix index values, that would greatly reduce memory usage and preprocessing time. This is definitely the bottleneck in the preprocessing. I've found no way to do this online, so I hacked together my own way in the nlppipe.py file.
Wow that's a lot of memory. Are you talking about in memory usage or storage size?? That seems really high.
I will say, the implementation I've used in Spacy is computationally expensive. If we could figure out a smarter way to map Spacy hashes to embedding matrix index values, that would greatly reduce memory usage and preprocessing time. This is definitely the bottleneck in the preprocessing. I've found no way to do this online, so I hacked together my own way in the nlppipe.py file.
I found that you init the matrix with n*m size, where n is the document size and m is the hyperparameter, the matrix cost too much memory... And I have a question about lda2vec, Is the preprocess step generating the lda and word2vec results, and then combine these result to train the lda2vec model in running process ? I read the paper, but did not understand where lda and word2vec results coming from.
Is there any spark version implementation of lda2vec?
No there is no spark implementation. Not sure what you mean by m. M is the embedding size (as far as I understand from what you said). No way to make this smaller. Thing that might affect lda2vec would be preprocessing. I don't know how to program in spark, but I am open to learning it if it applies
No there is no spark implementation. Not sure what you mean by m. M is the embedding size (as far as I understand from what you said). No way to make this smaller. Thing that might affect lda2vec would be preprocessing. I don't know how to program in spark, but I am open to learning it if it applies
yeah, m is the embedding size. Is the preprocess step generating the lda and word2vec results, and then combine these result to train the lda2vec model in running process ? I read the paper, but did not understand where lda and word2vec results coming from. Is the preprocess step generating the w2v and raw lda results?
Try reducing batch size from 10,000 to 500 in nlppipe.py in the tokenizer() method on line #184:
E.g. -
for row, doc in enumerate(self.nlp.pipe(self.texts, n_threads=self.num_threads, batch_size=500)):
@dbl001 thank you!!! Totally forgot about that.
At batch_size= 500, my test file of news stories consume > 8gb of DRAM.
$ ls -l stories.txt -rw-r--r-- 1 davidlaxer staff 68793151 Nov 8 08:48 stories.txt $ wc stories.txt 5172 10169765 68793151 stories.txt
On Nov 9, 2018, at 10:43 AM, Nathan Raw notifications@github.com wrote:
@dbl001 https://github.com/dbl001 thank you!!! Totally forgot about that.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437455988, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2znAucfn451RwB-uSv-yeVxJQGaMks5utczYgaJpZM4YKtIy.
@dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.
Investigating …
$ python load_stories.py Using TensorFlow backend. It took 13812.14863204956 seconds to run tokenizer method
On Nov 9, 2018, at 12:04 PM, Nathan Raw notifications@github.com wrote:
@dbl001 https://github.com/dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437479346, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i20ILW3RkHOoE5KJ7mP_Or_217wJLks5utd-ygaJpZM4YKtIy.
Converting to skip grams used ~18gb of DRAM.
converting to skipgrams step 0 of 5172 step 500 of 5172 step 1000 of 5172 step 1500 of 5172 step 2000 of 5172 step 2500 of 5172 step 3000 of 5172 step 3500 of 5172 step 4000 of 5172 step 4500 of 5172 step 5000 of 5172
On Nov 9, 2018, at 2:24 PM, David Laxer davidl@softintel.com wrote:
Investigating …
$ python load_stories.py Using TensorFlow backend. It took 13812.14863204956 seconds to run tokenizer method
On Nov 9, 2018, at 12:04 PM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:
@dbl001 https://github.com/dbl001 Do you think we can (or should) make any changes to reduce the amount of memory consumed further? I can't tell if that amount (8GB) is an issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-437479346, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i20ILW3RkHOoE5KJ7mP_Or_217wJLks5utd-ygaJpZM4YKtIy.
@dbl001 @nateraw I changed the batch size to 500 in file nlppipe.py on line #184 and got a "KeyError". so I changed the batch size to 500 on line #282 too, but I still got the same error. the error message is shown blow:
<class 'KeyError'> nlppipe.py 199
Traceback (most recent call last):
File "E:/python/Lda2vec-Tensorflow/tests/twenty_newsgroups/load_20newsgroups.py", line 21, in
how can I fix it?
@schneeLee the line you need to change is the batch size parameter of the nlp.pipe function within nlppipe.py. I need to update this to be more easily changed.
You might also consider adjusting ‘write_every’ (e.g. - write_every=100) when invoking utils.run_preprocessing() if you have a large input file.
On Nov 27, 2018, at 11:03 PM, Nathan Raw notifications@github.com wrote:
@schneeLee https://github.com/schneeLee the line you need to change is the batch size parameter of the nlp.pipe function within nlppipe.py. I need to update this to be more easily changed.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442341686, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2zwPbr0emLLIyQCqjat4jbRln0qhks5uzjVHgaJpZM4YKtIy.
@dbl001 you know my code better than me! I wonder if there is a better way to write that section too. The "write_every" fix was a quick and dirty solution, I feel like.
Collecting all the data into a list and outputting the entire list kept running out of memory (even replacing the data frame with file I/O: E.g.
import csv
with open(file_out_path + "/skipgrams.txt", 'wb') as myfile:
wr = csv.writer(myfile, delimiter="\t", quoting=csv.QUOTE_ALL)
wr.writerow(data)
On Nov 28, 2018, at 9:26 AM, Nathan Raw notifications@github.com wrote:
@dbl001 https://github.com/dbl001 you know my code better than me! I wonder if there is a better way to write that section too. The "write_every" fix was a quick and dirty solution, I feel like.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442532727, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i25D43pzYbYukbkZ4ZnwhrP8Mcbz2ks5uzsdFgaJpZM4YKtIy.
Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)
The code below ran out of memory on a 5,000 line input file with 10 million rows.
So, the incremental write is probably best. Is creating a data frame to output the CSV less efficient then doing python file I/O?
Probably, but it may not make a noticeable difference.
$wc data/stories.txt 5172 10128815 68793151 data/stories.txt
I’m currently running line profiling on utils.run_processing(). Stay tuned.
On Nov 28, 2018, at 10:13 AM, Nathan Raw notifications@github.com wrote:
Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442549183, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2ym0p4qRyfrnTjanLNC8u8MTBQ7vks5uztJKgaJpZM4YKtIy.
(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$ cat profile.out Timer unit: 1e-06 s
Total time: 87014.2 s File: /Users/davidlaxer/Lda2vec-Tensorflow/lda2vec/utils.py Function: run_preprocessing at line 91
91 @profile
92 def run_preprocessing(texts, data_dir, run_name, min_freq_threshold=10,
93 max_length=100, bad=[], vectors="en_core_web_lg",
94 num_threads=2, token_type="lemma", only_keep_alpha=False,
95 write_every=10000, merge=False):
96 """This function abstracts the rest of the preprocessing needed
97 to run Lda2Vec in conjunction with the NlpPipeline
98
99 Parameters
100 ----------
101 texts : TYPE
102 Python list of text
103 data_dir : TYPE
104 directory where your data is held
105 run_name : TYPE
106 Name of sub-directory to be created that will hold preprocessed data
107 min_freq_threshold : int, optional
108 If words occur less frequently than this threshold, then purge them from the docs
109 max_length : int, optional
110 Length to pad/cut off sequences
111 bad : list, optional
112 List or Set of words to filter out of dataset
113 vectors : str, optional
114 Name of vectors to load from spacy (Ex. "en", "en_core_web_sm")
115 num_threads : int, optional
116 Number of threads used in spacy pipeline
117 token_type : str, optional
118 Type of tokens to keep (Options: "lemma", "lower", "orth")
119 only_keep_alpha : bool, optional
120 Only keep alpha characters
121 write_every : int, optional
122 Number of documents' data to store before writing cache to skipgrams file
123 merge : bool, optional
124 Merge noun phrases or not
125 """
126
127 1 15572.0 15572.0 0.0 def clean(line):
128 return ' '.join(w for w in line.split() if not any(t in w for t in bad))
129
130 # Location for preprocessed data to be stored
131 1 8.0 8.0 0.0 file_out_path = data_dir + "/" + run_name
132
133 1 2904.0 2904.0 0.0 if not os.path.exists(file_out_path):
134
135 # Make directory to save data in
136 1 198.0 198.0 0.0 os.makedirs(file_out_path)
137
138 # Remove tokens with these substrings
139 1 9.0 9.0 0.0 bad = set(bad)
140
141 # Preprocess data
142
143 # Convert to unicode (spaCy only works with unicode)
144 1 74906607.0 74906607.0 0.1 texts = [str(clean(d)) for d in texts]
145
146 # Process the text, no file because we are passing in data directly
147 1 11.0 11.0 0.0 SP = NlpPipeline(None, max_length, texts=texts,
148 1 5.0 5.0 0.0 num_threads=num_threads, only_keep_alpha=only_keep_alpha,
149 1 12700477964.0 12700477964.0 14.6 token_type=token_type, vectors=vectors, merge=merge)
150
151 # Computes the embed matrix along with other variables
152 1 61100298107.0 61100298107.0 70.2 SP._compute_embed_matrix()
153
154 1 6620.0 6620.0 0.0 print("converting data to w2v indexes")
155 # Convert data to word2vec indexes
156 1 256696370.0 256696370.0 0.3 SP.convert_data_to_word2vec_indexes()
157
158 1 2465.0 2465.0 0.0 print("trimming 0's")
159 # Trim zeros from idx data
160 1 293081773.0 293081773.0 0.3 SP.trim_zeros_from_idx_data()
161
162 # This extracts the length of each document (needed for pyldaviz)
163 1 79761.0 79761.0 0.0 doc_lengths = [len(x) for x in SP.idx_data]
164
165 # Find the cutoff idx
166 33365 162720.0 4.9 0.0 for i, freq in enumerate(SP.freqs):
167 33365 154817.0 4.6 0.0 if freq < min_freq_threshold:
168 1 3.0 3.0 0.0 cutoff = i
169 1 5.0 5.0 0.0 break
170 # Then, cut off the embed matrix
171 1 43.0 43.0 0.0 embed_matrix = SP.embed_matrix[:cutoff]
172 # Also, replace all tokens below cutoff in idx_data
173 5173 34349.0 6.6 0.0 for i in range(len(SP.idx_data)):
174 5172 2035591.0 393.6 0.0 SP.idx_data[i][SP.idx_data[i] > cutoff - 1] = 0
175 # Next, cut off the frequencies
176 1 1562.0 1562.0 0.0 freqs = SP.freqs[:cutoff]
177
178 1 84.0 84.0 0.0 print("converting to skipgrams")
179
180 1 3.0 3.0 0.0 data = []
181 1 31.0 31.0 0.0 num_examples = SP.idx_data.shape[0]
182 # Sometimes docs can be less than the required amount for
183 # the skipgram function. So, we must manually make a counter
184 # instead of relying on the enumerated index (i)
185 1 3.0 3.0 0.0 doc_id_counter = 0
186 # Additionally, we will keep track of these lower level docs
187 # and will purge them later
188 1 3.0 3.0 0.0 purged_docs = []
189 5173 679517.0 131.4 0.0 for i, t in enumerate(SP.idxdata):
190 5172 30736.0 5.9 0.0 pairs, = skipgrams(t,
191 5172 31826.0 6.2 0.0 vocabulary_size=SP.vocab_size,
192 5172 17912.0 3.5 0.0 window_size=5,
193 5172 15887.0 3.1 0.0 shuffle=True,
194 5172 5047109332.0 975852.5 5.8 negative_samples=0)
195 # Pairs will be 0 if document is less than 2 indexes
196 5172 76872.0 14.9 0.0 if len(pairs) > 2:
197 109260975 444886451.0 4.1 0.5 for pair in pairs:
198 109255804 401964381.0 3.7 0.5 temp_data = pair
199 # Appends doc ID
200 109255804 543955438.0 5.0 0.6 temp_data.append(doc_id_counter)
201 # Appends document index
202 109255804 422388218.0 3.9 0.5 temp_data.append(i)
203 109255804 436079510.0 4.0 0.5 data.append(temp_data)
204 5171 27482.0 5.3 0.0 doc_id_counter += 1
205 else:
206 1 4.0 4.0 0.0 purged_docs.append(i)
207 5172 22056.0 4.3 0.0 if i // write_every:
208 temp_df = pd.DataFrame(data)
209 temp_df.to_csv(file_out_path + "/skipgrams.txt", sep="\t", index=False, header=None, mode="a")
210 del temp_df
211 data = []
212 5172 22860.0 4.4 0.0 if i % 500 == 0:
213 11 355847.0 32349.7 0.0 print("step", i, "of", num_examples)
214 1 4352734484.0 4352734484.0 5.0 temp_df = pd.DataFrame(data)
215 1 925359541.0 925359541.0 1.1 temp_df.to_csv(file_out_path + "/skipgrams.txt", sep="\t", index=False, header=None, mode="a")
216 1 494780.0 494780.0 0.0 del temp_df
217
218 # Save embed matrix
219 1 8285844.0 8285844.0 0.0 np.save(file_out_path + "/embed_matrix", embed_matrix)
220 # Save the doc lengths to be used later, also, purge those that didnt make it into skipgram function
221 1 124899.0 124899.0 0.0 np.save(file_out_path + "/doc_lengths", np.delete(doc_lengths, np.array(purged_docs)))
222 # Save frequencies to file
223 1 197430.0 197430.0 0.0 np.save(file_out_path + "/freqs", freqs)
224 # Save vocabulary dictionaries to file
225 1 229.0 229.0 0.0 idx_to_word_out = open(file_out_path + "/" + "idx_to_word.pickle", "wb")
226 1 1024778.0 1024778.0 0.0 pickle.dump(SP.idx_to_word, idx_to_word_out)
227 1 76339.0 76339.0 0.0 idx_to_word_out.close()
228 1 235.0 235.0 0.0 word_to_idx_out = open(file_out_path + "/" + "word_to_idx.pickle", "wb")
229 1 202652.0 202652.0 0.0 pickle.dump(SP.word_to_idx, word_to_idx_out)
230 1 92133.0 92133.0 0.0 word_to_idx_out.close()
(ai) David-Laxers-MacBook-Pro:twenty_newsgroups davidlaxer$
On Nov 28, 2018, at 10:41 AM, David Laxer davidl@softintel.com wrote:
The code below ran out of memory on a 5,000 line input file with 10 million rows. So, the incremental write is probably best. Is creating a data frame to output the CSV less efficient then doing python file I/O?
Probably, but it may not make a noticeable difference.$wc data/stories.txt 5172 10128815 68793151 data/stories.txt
I’m currently running line profiling on utils.run_processing(). Stay tuned.
On Nov 28, 2018, at 10:13 AM, Nathan Raw <notifications@github.com mailto:notifications@github.com> wrote:
Looks good to me, I'll test when I get the chance. Feel free to submit pull request if you want :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442549183, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i2ym0p4qRyfrnTjanLNC8u8MTBQ7vks5uztJKgaJpZM4YKtIy.
after I changeed the batch size parameter of the nlp.pipe function to 500, I still got the "KeyError" which I mentioned before. But when I cut down the 20_newsgroups corpus to a smaller size, it works. Is there anything special in corpus or in code?
Is this self.vocabulary, computed in the function: compute_embed_matrix() in nlppipe.py being use
self.vocabulary = np.append(self.vocabulary, word)
It’s taking 98.5% of the time in compute_embed_matrix(): E.g. - 506 150836 68137415647.0 451731.8 98.5 self.vocabulary = np.append(self.vocabulary, word)
numpy appends require a physical copy (because the use contiguous memory). If self.vocabulary is required to be a unique list, then appends to python lists are faster and more efficient. Once the unique list is generated, you can generate a numpy array.
On Nov 28, 2018, at 5:25 PM, schneeLee notifications@github.com wrote:
after I changeed the batch size parameter of the nlp.pipe function to 500, I still got the "KeyError" which I mentioned before. But when I cut down the 20_newsgroups corpus to a smaller size, it works. Is there anything special in corpus or in code?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nateraw/Lda2vec-Tensorflow/issues/12#issuecomment-442670286, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9i27yxowpahviiCEBbYw8JA4TZoLpWks5uzzeSgaJpZM4YKtIy.
@dbl001 thanks for the help. I forgot to adding the "texts" in the first line of the corpus. and there are some other questions: how can I get top N topics for each doc and the topN words for each topic after the train process is finished? and how to predict the topics of the unseen docs?
@dbl001 @nateraw I changed the batch size to 500 in file nlppipe.py on line #184 and got a "KeyError". so I changed the batch size to 500 on line #282 too, but I still got the same error. the error message is shown blow:
Did you manage to find a solution? Same KeyError.
Pushing new preprocessing today. Looking back, I have no idea why I made it so complicated. Much easier to understand version coming shortly.
Updated. Closing this issue.
I replaced the 20newsgroups data file to my own data file which size is about 24M, when I ran the load_20newsgroups.py, it cost about 90G memory, why this implementation costs such big memory???