Closed taylorhickem closed 3 years ago
possible source of the exception is the matrix generated from the nltk package class CountVectorizer methods:
these methods are called in text.py method get_bigram_matrix()
def get_bigram_matrix(titles):
vect = CountVectorizer(min_df=5, ngram_range=(2, 2), analyzer='word').fit(titles)
feature_names = np.array(vect.get_feature_names())
X_v = vect.transform(titles)
return (feature_names, X_v)
which is called from match.py method score_profile_title()
def score_profile_title():
global profiles
#01 load job profiles from sql
load_job_profiles()
#02 get bigram matrix from profiles
tx.push_tag_gsheets_to_sql(skip=['title'])
profiles = tx.add_clean_deranked_titles(profiles)
feature_names, X_v = tx.get_bigram_matrix(profiles.deranked_title)
title_tags = tx.tag_sheets['title']['data'].copy()
feature_scores = pd.merge(pd.DataFrame({'tag':feature_names}),
title_tags[['tag','score']],on='tag',how='left')['score'].values
feature_scores = [0 if pd.isna(x) else x for x in feature_scores]
title_scores = tx.np.array(tx.np.matmul(X_v.todense(),
tx.np.array(feature_scores).transpose()))[0]
profiles['title_score'] = title_scores
one possible workaround is to reduce the size of the matrix by creating a subset of the profiles matrix. For profiles which have already been scored, they can be dropped from the test set to be scored, and for those which are kept to train the classifier, they can be trimmed based on most recent or randomly sampled to reduce the size of the training matrix.
profiles = tx.add_clean_deranked_titles(profiles)
subset = ....
feature_names, X_v = tx.get_bigram_matrix(subset.deranked_title)
exception message
Traceback (most recent call last):
File "report.py", line 170, in <module>
autorun()
File "report.py", line 162, in autorun
update_job_profiles(limit)
File "report.py", line 113, in update_job_profiles
screen_jobs()
File "report.py", line 106, in screen_jobs
match.screen_jobs()
File "C:\..\Python36\Lib\jobsearch\match.py", line 86, in screen_jobs
update_matches()
File "C:\..\Python36\Lib\jobsearch\match.py", line 108, in update_matches
score_positions()
File "C:\..\Python36\Lib\jobsearch\match.py", line 126, in score_positions
score_profile_title()
File "C:\..\Python36\Lib\jobsearch\match.py", line 152, in score_profile_title
tx.np.array(feature_scores).transpose()))[0]
MemoryError: Unable to allocate 318. MiB for an array with shape (23861, 1749) and data type float64
do not see exception when screen_jobs() is run as a separate process from update_job_profiles()
problem wasn't fully resolved. Added 90d filter to limit the size of the matrix.
problem running update_job_profiles()
error MB limit reached. haven't isolated specific line of code that throws the exception. Goes away if the # of jobs searched at a time is reduced ex 400 --> 100.
Theory 1) has to do with the big matrix that uses all of the past records. Could probably avoid by limiting to a subset like most recent.