Significant changes made to the codebase with this commit:
cleanjob.py - a number of further regexp_replaces to clean the data.
intakejob.py - migrating the json collation away from pandas, which required extensive schema matching, to pyspark, cleaning up code significantly
keywordjob.py - get_hotwords is expanded upon, cleaned of fragments that exist post intakejob. a dupe_word_filter is created but unimplemented, as I wasn't happy with the results so far. TBA.
pipe_utils.py - now features w3lib.html.replace_entities to clean up html tags / utf replacements
reportjob.py - experimenting with 100 most common keywords instead of 20.
Significant changes made to the codebase with this commit:
w3lib.html.replace_entities
to clean up html tags / utf replacements