thp / urlwatch

Watch (parts of) webpages and get notified when something changes via e-mail, on your phone or via other means. Highly configurable.
https://thp.io/2008/urlwatch/
Other
2.81k stars 352 forks source link

html2text doesn't remove JS? #708

Closed devicenull closed 2 years ago

devicenull commented 2 years ago

I have the following config:

name: "fire"
url: "https://www.pfd2.org/meeting-minutes/2022-meeting-minutes/"
filter:
  - html2text:
  - strip:

However, when I get email reports they include a large blob of javascript:

- var fileaway_filetype_groups = {'adobe' : ['abf', 'aep', 'afm', 'ai', 'as', 'eps', 'fla', 'flv', 'fm', 'indd', 'pdd', 'pdf', 'pmd', 'ppj', 'prc', 'ps', 'psb', 'psd', 'swf'], 'application' : ['bat', 'dll', 'exe', 'msi'], 'audio' : ['aac', 'aif', 'aifc', 'aiff', 'amr', 'ape', 'au', 'bwf', 'flac', 'iff', 'gsm', 'la', 'm4a', 'm4b', 'm4p', 'mid', 'mp2', 'mp3', 'mpc', 'ogg', 'ots', 'ram', 'raw', 'rex', 'rx2', 'spx', 'swa', 'tta', 'vox', 'wav', 'wma', 'wv'], 'compression' : ['7z', 'a', 'ace', 'afa', 'ar', 'bz2', 'cab', 'cfs', 'cpio', 'cpt', 'dar', 'dd', 'dmg', 'gz', 'lz', 'lzma', 'lzo', 'mar', 'rar', 'rz', 's7z', 'sda', 'sfark', 'shar', 'tar', 'tgz', 'xz', 'z', 'zip', 'zipx', 'zz'], 'css' : ['css', 'less', 'sass', 'scss'], 'image' : ['bmp', 'dds', 'exif', 'gif', 'hdp', 'hdr', 'iff', 'jfif', 'jpeg', 'jpg', 'jxr', 'pam', 'pbm', 'pfm', 'pgm', 'png', 'pnm', 'ppm', 'raw', 'rgbe', 'tga', 'thm', 'tif', 'tiff', 'webp', 'wdp', 'yuv'], 'msdoc' : ['doc', 'docm', 'docx', 'dot', 'dotx'], 'msexcel' : ['xls', 'xlsm', 'xlsb', 'xlsx', 'xlt', 'xltm', 'xltx', 'xlw'], 'openoffice' : ['dbf', 'dbf4', 'odp', 'ods', 'odt', 'stc', 'sti', 'stw', 'sxc', 'sxi', 'sxw'], 'powerpoint' : ['pot', 'potm', 'potx', 'pps', 'ppt', 'pptm', 'pptx', 'pub'], 'script' : ['asp', 'cfm', 'cgi', 'clas', 'class', 'cpp', 'htm', 'html', 'java', 'js', 'php', 'pl', 'py', 'rb', 'shtm', 'shtml', 'xhtm', 'xhtml', 'xml', 'yml'], 'text' : ['123', 'csv', 'log', 'psw', 'rtf', 'sql', 'txt', 'uof', 'uot', 'wk1', 'wks', 'wpd', 'wps'], 'video' : ['avi', 'divx', 'mov', 'm4p', 'm4v', 'mkv', 'mp4', 'mpeg', 'mpg', 'ogv', 'qt', 'rm', 'rmvb', 'vob', 'webm', 'wmv']}; var ssfa_filetype_icons = {'adobe' : '!', 'application' : 'T', 'audio' : 'C', 'compression' : ''', 'css' : '(', 'image' : '1', 'msdoc' : '#', 'msexcel' : '$', 'openoffice' : '"', 'powerpoint' : '&', 'script' : '%', 'text' : '.', 'video' : 'W', 'unknown' : ')'} img#wpstats{display:none}
+ var fileaway_filetype_groups = {'adobe' : ['abf', 'aep', 'afm', 'ai', 'as', 'eps', 'fla', 'flv', 'fm', 'indd', 'pdd', 'pdf', 'pmd', 'ppj', 'prc', 'ps', 'psb', 'psd', 'swf'], 'application' : ['bat', 'dll', 'exe', 'msi'], 'audio' : ['aac', 'aif', 'aifc', 'aiff', 'amr', 'ape', 'au', 'bwf', 'flac', 'iff', 'gsm', 'la', 'm4a', 'm4b', 'm4p', 'mid', 'mp2', 'mp3', 'mpc', 'ogg', 'ots', 'ram', 'raw', 'rex', 'rx2', 'spx', 'swa', 'tta', 'vox', 'wav', 'wma', 'wv'], 'compression' : ['7z', 'a', 'ace', 'afa', 'ar', 'bz2', 'cab', 'cfs', 'cpio', 'cpt', 'dar', 'dd', 'dmg', 'gz', 'lz', 'lzma', 'lzo', 'mar', 'rar', 'rz', 's7z', 'sda', 'sfark', 'shar', 'tar', 'tgz', 'xz', 'z', 'zip', 'zipx', 'zz'], 'css' : ['css', 'less', 'sass', 'scss'], 'image' : ['bmp', 'dds', 'exif', 'gif', 'hdp', 'hdr', 'iff', 'jfif', 'jpeg', 'jpg', 'jxr', 'pam', 'pbm', 'pfm', 'pgm', 'png', 'pnm', 'ppm', 'raw', 'rgbe', 'tga', 'thm', 'tif', 'tiff', 'webp', 'wdp', 'yuv'], 'msdoc' : ['doc', 'docm', 'docx', 'dot', 'dotx'], 'msexcel' : ['xls', 'xlsm', 'xlsb', 'xlsx', 'xlt', 'xltm', 'xltx', 'xlw'], 'openoffice' : ['dbf', 'dbf4', 'odp', 'ods', 'odt', 'stc', 'sti', 'stw', 'sxc', 'sxi', 'sxw'], 'powerpoint' : ['pot', 'potm', 'potx', 'pps', 'ppt', 'pptm', 'pptx', 'pub'], 'script' : ['asp', 'cfm', 'cgi', 'clas', 'class', 'cpp', 'htm', 'html', 'java', 'js', 'php', 'pl', 'py', 'rb', 'shtm', 'shtml', 'xhtm', 'xhtml', 'xml', 'yml'], 'text' : ['123', 'csv', 'log', 'psw', 'rtf', 'sql', 'txt', 'uof', 'uot', 'wk1', 'wks', 'wpd', 'wps'], 'video' : ['avi', 'divx', 'mov', 'm4p', 'm4v', 'mkv', 'mp4', 'mpeg', 'mpg', 'ogv', 'qt', 'rm', 'rmvb', 'vob', 'webm', 'wmv']}; var ssfa_filetype_icons = {'adobe' : '!', 'application' : 'T', 'audio' : 'C', 'compression' : ''', 'css' : '(', 'image' : 'b', 'msdoc' : '#', 'msexcel' : '$', 'openoffice' : '"', 'powerpoint' : '&', 'script' : '%', 'text' : '.', 'video' : 'W', 'unknown' : ')'} img#wpstats{display:none}

Apparently this list changes fairly regularly - any suggestions other then using the grepi filter?

thp commented 2 years ago

If you are only interested in e.g. when a new PDF on that page gets added, maybe a CSS filter like this could do the trick?

filter:
    - css: '#ssfa-table-1010 span.ssfa-filename'
    - html2text:

If you want to have the whole table contents as text:

filter:
    - css: '#ssfa-table-1010'
    - html2text:
thp commented 2 years ago

@devicenull Any updates? Did the answer above fix your issue?

devicenull commented 2 years ago

Yea, sorry I forgot to come back and close this