taylorhickem / jobsearch

reports on job openings using web parsing
0 stars 0 forks source link

MB limit exception #45

Closed taylorhickem closed 3 years ago

taylorhickem commented 3 years ago

problem running update_job_profiles()

error MB limit reached. haven't isolated specific line of code that throws the exception. Goes away if the # of jobs searched at a time is reduced ex 400 --> 100.

Theory 1) has to do with the big matrix that uses all of the past records. Could probably avoid by limiting to a subset like most recent.

taylorhickem commented 3 years ago

possible source of the exception is the matrix generated from the nltk package class CountVectorizer methods:

these methods are called in text.py method get_bigram_matrix()

def get_bigram_matrix(titles):
    vect = CountVectorizer(min_df=5, ngram_range=(2, 2), analyzer='word').fit(titles)
    feature_names = np.array(vect.get_feature_names())
    X_v = vect.transform(titles)
    return (feature_names, X_v)

which is called from match.py method score_profile_title()

def score_profile_title():
    global profiles
    #01 load job profiles from sql
    load_job_profiles()
    #02 get bigram matrix from profiles
    tx.push_tag_gsheets_to_sql(skip=['title'])
    profiles = tx.add_clean_deranked_titles(profiles)
    feature_names, X_v = tx.get_bigram_matrix(profiles.deranked_title)
    title_tags = tx.tag_sheets['title']['data'].copy()
    feature_scores = pd.merge(pd.DataFrame({'tag':feature_names}),
                              title_tags[['tag','score']],on='tag',how='left')['score'].values
    feature_scores = [0 if pd.isna(x) else x for x in feature_scores]
    title_scores = tx.np.array(tx.np.matmul(X_v.todense(),
                                            tx.np.array(feature_scores).transpose()))[0]
    profiles['title_score'] = title_scores

one possible workaround is to reduce the size of the matrix by creating a subset of the profiles matrix. For profiles which have already been scored, they can be dropped from the test set to be scored, and for those which are kept to train the classifier, they can be trimmed based on most recent or randomly sampled to reduce the size of the training matrix.

    profiles = tx.add_clean_deranked_titles(profiles)
    subset = ....
    feature_names, X_v = tx.get_bigram_matrix(subset.deranked_title)
taylorhickem commented 3 years ago

exception message

Traceback (most recent call last):
  File "report.py", line 170, in <module>
    autorun()
  File "report.py", line 162, in autorun
    update_job_profiles(limit)
  File "report.py", line 113, in update_job_profiles
    screen_jobs()
  File "report.py", line 106, in screen_jobs
    match.screen_jobs()
  File "C:\..\Python36\Lib\jobsearch\match.py", line 86, in screen_jobs
    update_matches()
  File "C:\..\Python36\Lib\jobsearch\match.py", line 108, in update_matches
    score_positions()
  File "C:\..\Python36\Lib\jobsearch\match.py", line 126, in score_positions
    score_profile_title()
  File "C:\..\Python36\Lib\jobsearch\match.py", line 152, in score_profile_title
    tx.np.array(feature_scores).transpose()))[0]
MemoryError: Unable to allocate 318. MiB for an array with shape (23861, 1749) and data type float64
taylorhickem commented 3 years ago

do not see exception when screen_jobs() is run as a separate process from update_job_profiles()

taylorhickem commented 3 years ago

problem wasn't fully resolved. Added 90d filter to limit the size of the matrix.