Closed ruoyu-qian closed 4 years ago
Yes! This is totally possible but not super easy. I don't know how it works, but here is some code from Lily Fierro who was doing batch updates using corex_topic. You can see that when "i==0" she does like a first pass of training the model, then in the "else" statement she takes a new batch (called "final" in this code) and then updates different parts of the model in some sequence.
import sys
sys.path.append('/opt/page-analysis/corex_topic-master/')
import corex_topic as ct
import numpy as np
import scipy.sparse as ss
import time
import os
import resource
import psutil
def load_sparse_csr(filedir, fl):
loader = np.load(filedir+fl)
full = ss.csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape'])
del loader
return full
def check_dir(f):
d = os.path.dirname(f)
if not os.path.exists(d):
os.makedirs(d)
if __name__ == "__main__":
soft_mem_set = (1024**3)*40 # limit memory usage to 40GB
hard_mem_set = (1024**3)*45 # limit memory usage to 45GB
resource.setrlimit(resource.RLIMIT_AS, (soft_mem_set, hard_mem_set))
filedir = 'data/corex_sparse_chunks_jlrecoded/'
colfile = 'sparse-columns-final.npz'
files = sorted(os.listdir(filedir)) # path to chunks of the full sparse matrix
files.remove(colfile)
files = files[2:4]
col_loader = np.load(filedir + colfile)
col = col_loader['data']
save_dir = 'data/corex_test_jlrecode_batchdebug/'
check_dir(save_dir)
for i, fd in enumerate(files):
try:
t1 = time.time()
final = load_sparse_csr(filedir, fd) # Load data
final *= 0.5
#final.data = np.where(np.isnan(final.data), 0.5, final.data) # Turn NAN to 0.5
print('sparse matrix load+transformation time: {}'.format(time.time() - t1))
process = psutil.Process(os.getpid())
mem = process.memory_info()[0] / float(2 ** 20)
print('Memory usage after sparse matrix construction: {}'.format(mem))
except MemoryError:
print('final sparse matrix larger than soft memory limit')
sys.exit(1)
try:
if i == 0:
out = ct.Corex(n_hidden=200, verbose=True, max_iter=10)
process = psutil.Process(os.getpid())
mem = process.memory_info()[0] / float(2 ** 20)
print('Memory usage after CorEx model instance created: {}'.format(mem))
t = time.time()
out.fit(final)
print('layer 0 time: {}'.format(time.time()-t))
process = psutil.Process(os.getpid())
mem = process.memory_info()[0] / float(2 ** 20)
print('Memory usage after CorEx model fitted: {}'.format(mem))
out.save(save_dir+'layer_0-init.dat')
else:
t = time.time()
p_y_given_x, _, log_z = out.calculate_latent(final, out.theta)
out.update_tc(log_z)
out.log_p_y = out.calculate_p_y(p_y_given_x)
out.theta = out.calculate_theta(final, p_y_given_x, out.log_p_y)
out.alpha = out.calculate_alpha(final, p_y_given_x, out.theta, out.log_p_y, out.tcs)
process = psutil.Process(os.getpid())
mem = process.memory_info()[0] / float(2 ** 20)
print('Memory usage after CorEx model updated: {}'.format(mem))
print('layer 0 update time: {}'.format(time.time() - t))
out.save(save_dir+'layer_0-update'+str(i)+'.dat')
except MemoryError:
print('CorEx run reaches soft memory limit')
del final
@gregversteeg Thank you so much for the help! The code from Lily Fierro is very insightful. I built an anchored corex model guided_model
with one initial dataset and then built a new document-term matrix new_matrix
with a new batch of data. I made the new_matrix
contain exactly the same vocabulary as the original matrix. With these two, I updated the model using the following functions as she did is the "else" statement:
guided_model.update_word_parameters(new_matrix,None)
p_y_given_x, _, log_z = guided_model.calculate_latent(new_matrix, guided_model.theta)
guided_model.log_p_y = guided_model.calculate_p_y(p_y_given_x)
guided_model.theta = guided_model.calculate_theta(new_matrix, p_y_given_x, guided_model.log_p_y)
guided_model.alpha = guided_model.calculate_alpha(new_matrix,
p_y_given_x,
guided_model.theta,
guided_model.log_p_y,
guided_model.tcs)
I think it worked and did update the model, but for now, it does not allow me to increase the vocabulary. I'm planning to make use of the initialize_parameters
function to see whether I could add new vocabulary to the model.
Do you have any suggestions for updating the vocabulary?
Thanks a lot!
Updating vocabulary is interesting, I haven't thought of that before. Sorry I don't have any ideas. All those parameters will have to be resized (theta and alpha, I guess), and given some random initial values for the new words, I guess.
Closing because of inactivity, and an elegant update of the vocabulary would take quite some work in terms of the mathematics underlying CorEx
Hi guys, thank you so much for developing and sharing the CorEx model. I've been working on an NLP project and have found the anchored model super helpful. I'm wondering whether it is possible to do batch processing or incremental modeling on CorEx? For example, if I already built a model but have a new batch of document coming in with new vocabulary. Is it possible to update original model with the new data?
Thank you!