Closed pauleaster closed 8 months ago
I have decided not to proceed with this at this point in time.
Reopened issue. Decided not to operate retroactively on database but will correctly work going forward. Extracted visible text to ignore html tags and search for the search_term phrase instead of individual words.
# Extract visible text from the soup object
visible_text = soup.get_text(separator=' ', strip=True).lower()
# Prepare regex pattern for exact phrase match with word boundaries
pattern = r'\b{}\b'.format(re.escape(search_term.lower()))
valid = bool(re.search(pattern, visible_text))
Also empty the scraper state after completion.
When searching for
rust
, the present code will also matchtrust
. Replace:valid = all(term in soup_str for term in search_terms)
withvalid = all(re.search(fr'\b{re.escape(term)}\b', soup_str) for term in search_terms)
inJobScraper.in_valid_link()
Then we need to create a python migration script that will parse allvalid=True
job_html
for eachsearch_term
and change these to false when they fail the modified test. Place this script in db_migrations asyyyymmdd_hhmmss_match_entire_words.py
. Back up existing psql database before running this script.JobScraper.in_valid_link()