pauleaster / jobsearch

MIT License
0 stars 0 forks source link

Match entire words and clear scraper status when successfully completed #20

Closed pauleaster closed 8 months ago

pauleaster commented 1 year ago

When searching for rust, the present code will also match trust. Replace: valid = all(term in soup_str for term in search_terms) with valid = all(re.search(fr'\b{re.escape(term)}\b', soup_str) for term in search_terms) in JobScraper.in_valid_link() Then we need to create a python migration script that will parse all valid=True job_html for each search_term and change these to false when they fail the modified test. Place this script in db_migrations as yyyymmdd_hhmmss_match_entire_words.py. Back up existing psql database before running this script.

pauleaster commented 1 year ago

I have decided not to proceed with this at this point in time.

pauleaster commented 8 months ago

Reopened issue. Decided not to operate retroactively on database but will correctly work going forward. Extracted visible text to ignore html tags and search for the search_term phrase instead of individual words.

# Extract visible text from the soup object
    visible_text = soup.get_text(separator=' ', strip=True).lower()

    # Prepare regex pattern for exact phrase match with word boundaries
    pattern = r'\b{}\b'.format(re.escape(search_term.lower()))
    valid = bool(re.search(pattern, visible_text))
pauleaster commented 8 months ago

Also empty the scraper state after completion.