ncbo / ncbo_cron

Jobs that run on a regular basis in the NCBO infrastructure
Other
2 stars 6 forks source link

add a safety check for the pull mechanism to ignore HTML pages #14

Open alexskr opened 6 years ago

alexskr commented 6 years ago

sometimes users put incorrect pull URL location which causes ncbo_cron to pull html pages and create a large number of bad sumissions. ideally script should make a quick determination if the pulled file is HTML document.

alexskr commented 5 years ago

might as well add a comprehensive file type detection/verification mechanism to filter out files like images, html, js