hrbrmstr / carbondater

📆 Estimate the Age of Web Resources
9 stars 0 forks source link

Error in data.frame: 'mtag' and 'mval' lengths are different #1

Open ChrisMuir opened 6 years ago

ChrisMuir commented 6 years ago

Hi Bob,

Edit to add: I know this is a super new package and didn't expect everything to work perfectly, this issue is just to give you a heads up on the error!

Just checking this pkg out, and am running into and error. I tried running carbondater::carbondate() on this site: http://www.sdfda.gov.cn/art/2017/12/21/art_3715_190173.html (it's a page from the gov website of the Shandong province in China). The functions fails with:

Error in data.frame(mtag = rvest::html_nodes(x, xpath = ".//meta[@http-equiv or @itemprop or @name or @property]/\n             @*[name()='http-equiv' or name()='itemprop' or name()='name' or name()='property']") %>%  : 
  arguments imply differing number of rows: 25, 21

After looking into it some, what's causing the issue is that, within the site content, there's four meta tags that have a name attribute but do NOT have a content attribute. So within function get_earliest_pubdate(), data.frame() is being passed 25 mtag values but only 21 mval values.

Here's some minimal code to reproduce the lengths disparity:

uri <- "http://www.sdfda.gov.cn/art/2017/12/21/art_3715_190173.html"
x <- carbondater:::safe_GET(uri, httr::user_agent(carbondater:::.ua))
x <- suppressMessages(httr::content(x))

mtag <- rvest::html_nodes(
  x,
  xpath=".//meta[@http-equiv or @itemprop or @name or @property]/
  @*[name()='http-equiv' or name()='itemprop' or name()='name' or name()='property']"
) %>% rvest::html_text() %>% tolower()

mval <- rvest::html_nodes(
  x,
  xpath=".//meta[@http-equiv or @itemprop or @name or @property]/@content"
) %>% rvest::html_text() %>% tolower()

cat(length(mtag), length(mval))
#> 25 21

And here's the page content, printed as a char vector:

uri <- "http://www.sdfda.gov.cn/art/2017/12/21/art_3715_190173.html"
x <- carbondater:::safe_GET(uri, httr::user_agent(carbondater:::.ua))
x <- httr::content(x)
as.character(x)
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n<html>\n<head>\n<title>关于食品中菌落总数的风险解读</title>\n<meta name=\"Keywords\" content=\"内容管理、内容管理发布(CMS)系统、信息发布、新闻采编发系统、知识管理、知识门户、政府门户、教育门户、企业门户、竞争情报系统、抓取系统、信息采集、信息雷达系统、电子政务、电子政务解决方案、办公系统、OA、网站办公系统\">\n<meta name=\"Generator\" content=\"大汉版通\">\n<meta name=\"Author\" content=\"大汉网络\">\n<meta name=\"Maketime\" content=\"2017-12-21 08:47:04\">\n<meta name=\"subsite\" content=\"山东省食品药品监督管理局\">\n<meta name=\"channel\" content=\"质量安全公告\">\n<meta name=\"category\" content=\"\">\n<meta name=\"author\" content=\"\">\n<meta name=\"pubDate\" content=\"2017-12-21 08:43:50\">\n<meta name=\"source\" content=\"综合处\">\n<meta name=\"language\" content=\"中文\">\n<meta name=\"location\" content=\"\">\n<meta name=\"department\" content=\"\">\n<meta name=\"title\" content=\"关于食品中菌落总数的风险解读\">\n<meta name=\"description\" content=\"关于食品中菌落总数的风险解读 菌落总数是指在一定培养条件下(如需氧情况、营养条件、酸碱度、培养温度等),每克(每毫克)检验样品所生长出来的菌落数。菌落总数测定是用来判定食品被细菌污染的程度及卫生质量,以便对被检样品作出适当的卫生评价。食用菌落总数超标的食品,可能会引起急性中毒、呕吐、腹泻等症状,危害人体健康安全。\">\n<meta name=\"guid\" content=\"20170190173\">\n<meta name=\"effectiveTime\" content=\"0\">\n<meta name=\"keyword\" content=\"食品 生产\">\n<meta name=\"level\" content=\"0\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\n<link href=\"/script/page.css\" type=\"text/css\" rel=\"stylesheet\">\n<script language=\"javascript\" src=\"/module/jslib/jquery/jquery.js\"></script><meta name=\"pageSize\" content=\"1\">\n<link href=\"/images/261/sdsyjj_2015wzy.css\" rel=\"stylesheet\" type=\"text/css\">\n<style type=\"text/css\">\r\n    .bgall{\r\n\tbackground:url(/images/261/spyjj_1_01.jpg);\r\n\tbackground-repeat:repeat-x;\r\n\tbackground-position:center top;\r\n\t}\r\n\t.bgal2{\r\n\tbackground:url(/images/261/5-22db_02.jpg);\r\n\tbackground-repeat:no-repeat;\r\n\tbackground-position:center top;\r\n\theight:246px;\r\n\t}\r\n\t\r\n\r\n</style>\n</head>\n<body bgcolor=\"#FFFFFF\" leftmargin=\"0\" topmargin=\"0\" marginwidth=\"0\" marginheight=\"0\" background=\"/images/261/spyjj_11_02.jpg\">\r\n<div class=\"bgall\">\r\n<table width=\"1000\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\"><tr>\n<td height=\"31\" width=\"120\" style=\"color:#FFF; font-size:12px;\"><script language=\"javascript\" src=\"/script/0/1506011009337837.js\"></script></td>\r\n    <td width=\"130\" style=\"color:#FFF; font-size:12px;\" align=\"right\"><script language=\"javascript\" src=\"/script/0/1506010949045558.js\"></script></td>\r\n    <td width=\"200\"></td>\r\n    \r\n    <td width=\"550\" align=\"center\" style=\"color:#FFF;\"><script language=\"javascript\" src=\"/script/0/1506011010349918.js\"></script></td>\r\n  </tr></table>\n<table width=\"1000\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\"><tr>\n<td height=\"140\" align=\"center\"><script language=\"javascript\" src=\"/script/0/1506011457123840.js\"></script></td>\r\n  </tr></table>\n<table width=\"1000\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\"><tr>\n<td height=\"36\" width=\"738\" style=\"background:url(/images/261/zy0601_dqwzbg.jpg); background-repeat: repeat-x; background-position:center;\">\r\n    <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" height=\"100%\"><tr>\n<td width=\"36\" align=\"center\" valign=\"middle\"><img src=\"/images/261/zy0601_dqbg1.jpg\" height=\"19\" width=\"21\"></td>\r\n\t\t<td width=\"80\" align=\"left\" valign=\"middle\" class=\"font16dqwz\">当前位置</td>\r\n\t\t<td width=\"14\" align=\"left\" valign=\"middle\"><img src=\"/images/261/zy0601_dqbg03.jpg\" height=\"14\" width=\"3\"></td>\r\n        <td width=\"608\" class=\"font16\"><table border=\"0\" align=\"left\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td><a href=\"/index.html\" class=\"font16\">首页</a></td>\n<td><table width=\"100%\" cellspacing=\"0\" cellpadding=\"0\" border=\"0\"><tr>\n<td> &gt; <a href=\"/col/col3556/index.html\" class=\"font16\">公众服务</a>\n</td>\n<td> &gt; <a href=\"/col/col3714/index.html\" class=\"font16\">公告通告</a>\n</td>\n<td> &gt; <a href=\"/col/col3715/index.html\" class=\"font16\">质量安全公告</a>\n</td>\n</tr></table></td>\n</tr></table></td>\r\n      </tr></table>\n</td>\r\n    <td width=\"10\"></td>\r\n    <td width=\"252\" align=\"center\"><script language=\"javascript\" src=\"/script/0/1506011331306679.js\"></script></td>\r\n  </tr></table>\n<table width=\"1000\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\" bgcolor=\"#FFFFFF\" style=\"margin-top:10px;\"><tr>\n<td height=\"600\" align=\"left\" valign=\"top\">\n<script language=\"javascript\">function doZoom(size){document.getElementById('zoom').style.fontSize=size+'px';}</script><table width=\"870\" align=\"center\" style=\"margin-top: 10px;\">\n<tr><td align=\"center\" class=\"title\">关于食品中菌落总数的风险解读<br>\n</td></tr>\n<tr><td>\n<table border=\"0\" align=\"center\" width=\"90%\"><tr>\n<td width=\"226\" align=\"center\">发布日期:2017-12-21</td>\n<td align=\"center\" width=\"306\">发布单位:综合处</td>\r\n<td width=\"232\">阅读次数:<script language=\"javascript\" src=\"/module/visitcount/articlehits.jsp?colid=3715&amp;artid=190173\">\n </script>\n</td>\n</tr></table>\n<table width=\"100%\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\" style=\"border-bottom: dotted 1px #dddddd;\"><tr>\n<td height=\"11\" align=\"left\" valign=\"middle\"></td>\r\n  </tr></table>\n<br>\n</td></tr>\n<tr><td class=\"bt_content\"><div id=\"zoom\">\n<meta name=\"ContentStart\">\n<!--ZJEG_RSS.content.begin--><meta name=\"ContentStart\">\n<p align=\"center\"><span style=\"font-size: 29px\"><span style=\"font-family: 宋体\">关于食品中菌落总数的风险解读</span></span></p>\r\n<p>     <span style=\"font-family: 宋体\"><span style=\"font-size: 21px\">菌落总数是指在一定培养条件下(如需氧情况、营养条件、酸碱度、培养温度等),每克(每毫克)检验样品所生长出来的菌落数。菌落总数测定是用来判定食品被细菌污染的程度及卫生质量,以便对被检样品作出适当的卫生评价。</span></span></p>\r\n<p><span style=\"font-family: 宋体\"><span style=\"font-size: 21px\">  食用菌落总数超标的食品,可能会引起急性中毒、呕吐、腹泻等症状,危害人体健康安全。</span></span></p>\r\n<p><span style=\"font-family: 宋体\"><span style=\"font-size: 21px\">  菌落总数超标,原因可能是食品从原辅料到运输、贮存、加工成成品以及销售等各个环节受到了微生物污染,如:生产环境卫生状况不良,生产设备连续使用,不经常清洗、消毒,容易产生微生物滞留和滋生,造成食品污染;生产操作人员不按生产要求进行操作、对生产设备清洗不干净或消毒不严、加工过程中生熟不分;食品储存和运输中没有按照食品相应的条件(如冷链)进行储运等,都会导致菌落总数超标。</span></span></p>\r\n<p align=\"center\"> </p>\n<meta name=\"ContentEnd\">\n<!--ZJEG_RSS.content.end--><meta name=\"ContentEnd\">\n</div></td></tr>\n<tr><td class=\"bt_content\" align=\"left\" height=\"10\" style=\"padding-left:60px;\">\n<br><br>\n</td></tr>\n</table>\n<br><table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\" style=\"border-bottom: dotted 1px #CCCCCC;\"><tr>\n<td height=\"3\" align=\"left\" valign=\"middle\"></td>\r\n  </tr></table>\n<table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td height=\"20\" align=\"left\" valign=\"middle\"></td>\r\n  </tr></table>\n<table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td height=\"40\" colspan=\"2\" align=\"left\" valign=\"middle\"><script language=\"javascript\" src=\"/module/changepage/gettitle.jsp?appid=1&amp;showtip=0&amp;titlelimit=60&amp;webid=1&amp;cataid=3715&amp;catatype=2&amp;position=prev&amp;infoid=190173\"></script></td>\r\n  </tr></table>\n<table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td height=\"10\" align=\"left\" valign=\"middle\"></td>\r\n  </tr></table>\n<table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td height=\"40\" colspan=\"2\" align=\"left\" valign=\"middle\"><script language=\"javascript\" src=\"/module/changepage/gettitle.jsp?appid=1&amp;showtip=0&amp;titlelimit=60&amp;webid=1&amp;cataid=3715&amp;catatype=2&amp;position=next&amp;infoid=190173\"></script></td>\r\n  </tr></table>\n<table width=\"870\" border=\"0\" align=\"center\" cellpadding=\"0\" cellspacing=\"0\"><tr>\n<td height=\"10\" align=\"left\" valign=\"middle\"></td>\r\n  </tr></table>\n</td>\r\n  </tr></table>\n</div>\r\n  <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\" style=\"background-color:#328fef;\"><tr>\n<td height=\"30\" align=\"center\"><script language=\"javascript\" src=\"/script/0/1506011339227803.js\"></script></td>\r\n  </tr></table>\n<table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" align=\"center\" style=\"background-color:#FFFFFF;\"><tr>\n<td align=\"center\"><script language=\"javascript\" src=\"/script/0/1506011500112505.js\"></script></td>\r\n  </tr></table>\n</body>\n</html><html><a href=\"http://www.hanweb.com\" style=\"display:none\">Produced By 大汉网络 大汉版通发布系统</a></html>\n

The problematic tags are <meta name=\"ContentStart\"> and <meta name=\"ContentEnd\">, they each appear twice.

I have no idea how rare this edge case is, but I figured I'd give you a heads up. Let me know if you have any questions, or if there's anything else I can do help.

ChrisMuir commented 6 years ago

Oh, and forgot sessionInfo():

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16       prettyunits_1.0.2  assertthat_0.2.0   carbondater_0.1.0  R6_2.2.2           magrittr_1.5       RApiDatetime_0.0.3 httr_1.3.1        
 [9] stringi_1.1.7      curl_3.2           xml2_1.2.0         urltools_1.7.0     tools_3.5.0        triebeard_0.3.0    anytime_0.3.0      yaml_2.1.19       
[17] compiler_3.5.0     rvest_0.3.2  
hrbrmstr commented 6 years ago

Wow! That's a rly rly rly helpful bug report!

Aye, I still need to give credit to the python code that inspired this (tho it's hard to do so since they violate ToS on alot of sites in their module).

It occurred to me late today that there are abt 4 other place some severe exception handling needs to take place so this rly helps triage one of them.

ty!

ChrisMuir commented 6 years ago

Sure thing, happy to help! Very cool package, this will come in handy for work stuffs. I'll have to check out the Python version.