gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Incremental Modeling #31

Closed ruoyu-qian closed 4 years ago

ruoyu-qian commented 4 years ago

Hi guys, thank you so much for developing and sharing the CorEx model. I've been working on an NLP project and have found the anchored model super helpful. I'm wondering whether it is possible to do batch processing or incremental modeling on CorEx? For example, if I already built a model but have a new batch of document coming in with new vocabulary. Is it possible to update original model with the new data?

Thank you!

gregversteeg commented 4 years ago

Yes! This is totally possible but not super easy. I don't know how it works, but here is some code from Lily Fierro who was doing batch updates using corex_topic. You can see that when "i==0" she does like a first pass of training the model, then in the "else" statement she takes a new batch (called "final" in this code) and then updates different parts of the model in some sequence.

import sys
sys.path.append('/opt/page-analysis/corex_topic-master/')
import corex_topic as ct
import numpy as np
import scipy.sparse as ss
import time
import os
import resource
import psutil

def load_sparse_csr(filedir, fl):
    loader = np.load(filedir+fl)
    full = ss.csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape'])
    del loader
    return full

def check_dir(f):
    d = os.path.dirname(f)
    if not os.path.exists(d):
        os.makedirs(d)

if __name__ == "__main__":
    soft_mem_set = (1024**3)*40  # limit memory usage to 40GB
    hard_mem_set = (1024**3)*45  # limit memory usage to 45GB
    resource.setrlimit(resource.RLIMIT_AS, (soft_mem_set, hard_mem_set))
    filedir = 'data/corex_sparse_chunks_jlrecoded/'
    colfile = 'sparse-columns-final.npz'
    files = sorted(os.listdir(filedir))  # path to chunks of the full sparse matrix
    files.remove(colfile)
    files = files[2:4]
    col_loader = np.load(filedir + colfile)
    col = col_loader['data']
    save_dir = 'data/corex_test_jlrecode_batchdebug/'
    check_dir(save_dir)
    for i, fd in enumerate(files):
        try:
            t1 = time.time()
            final = load_sparse_csr(filedir, fd)  # Load data
            final *= 0.5
            #final.data = np.where(np.isnan(final.data), 0.5, final.data)  # Turn NAN to 0.5
            print('sparse matrix load+transformation time: {}'.format(time.time() - t1))
            process = psutil.Process(os.getpid())
            mem = process.memory_info()[0] / float(2 ** 20)
            print('Memory usage after sparse matrix construction: {}'.format(mem))
        except MemoryError:
            print('final sparse matrix larger than soft memory limit')
            sys.exit(1)
        try:
            if i == 0:
                out = ct.Corex(n_hidden=200, verbose=True, max_iter=10)
                process = psutil.Process(os.getpid())
                mem = process.memory_info()[0] / float(2 ** 20)
                print('Memory usage after CorEx model instance created: {}'.format(mem))
                t = time.time()
                out.fit(final)
                print('layer 0 time: {}'.format(time.time()-t))
                process = psutil.Process(os.getpid())
                mem = process.memory_info()[0] / float(2 ** 20)
                print('Memory usage after CorEx model fitted: {}'.format(mem))
                out.save(save_dir+'layer_0-init.dat')
            else:
                t = time.time()
                p_y_given_x, _, log_z = out.calculate_latent(final, out.theta)
                out.update_tc(log_z)
                out.log_p_y = out.calculate_p_y(p_y_given_x)
                out.theta = out.calculate_theta(final, p_y_given_x, out.log_p_y)
                out.alpha = out.calculate_alpha(final, p_y_given_x, out.theta, out.log_p_y, out.tcs)
                process = psutil.Process(os.getpid())
                mem = process.memory_info()[0] / float(2 ** 20)
                print('Memory usage after CorEx model updated: {}'.format(mem))
                print('layer 0 update time: {}'.format(time.time() - t))
                out.save(save_dir+'layer_0-update'+str(i)+'.dat')
        except MemoryError:
            print('CorEx run reaches soft memory limit')
        del final
ruoyu-qian commented 4 years ago

@gregversteeg Thank you so much for the help! The code from Lily Fierro is very insightful. I built an anchored corex model guided_model with one initial dataset and then built a new document-term matrix new_matrix with a new batch of data. I made the new_matrix contain exactly the same vocabulary as the original matrix. With these two, I updated the model using the following functions as she did is the "else" statement:

guided_model.update_word_parameters(new_matrix,None)
p_y_given_x, _, log_z = guided_model.calculate_latent(new_matrix, guided_model.theta)
guided_model.log_p_y = guided_model.calculate_p_y(p_y_given_x)
guided_model.theta = guided_model.calculate_theta(new_matrix, p_y_given_x, guided_model.log_p_y)
guided_model.alpha = guided_model.calculate_alpha(new_matrix, 
                                                        p_y_given_x, 
                                                        guided_model.theta, 
                                                        guided_model.log_p_y, 
                                                        guided_model.tcs)

I think it worked and did update the model, but for now, it does not allow me to increase the vocabulary. I'm planning to make use of the initialize_parameters function to see whether I could add new vocabulary to the model.

Do you have any suggestions for updating the vocabulary?

Thanks a lot!

gregversteeg commented 4 years ago

Updating vocabulary is interesting, I haven't thought of that before. Sorry I don't have any ideas. All those parameters will have to be resized (theta and alpha, I guess), and given some random initial values for the new words, I guess.

ryanjgallagher commented 4 years ago

Closing because of inactivity, and an elegant update of the vocabulary would take quite some work in terms of the mathematics underlying CorEx