Open ChrisMuir opened 6 years ago
Oh, and forgot sessionInfo()
:
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 prettyunits_1.0.2 assertthat_0.2.0 carbondater_0.1.0 R6_2.2.2 magrittr_1.5 RApiDatetime_0.0.3 httr_1.3.1
[9] stringi_1.1.7 curl_3.2 xml2_1.2.0 urltools_1.7.0 tools_3.5.0 triebeard_0.3.0 anytime_0.3.0 yaml_2.1.19
[17] compiler_3.5.0 rvest_0.3.2
Wow! That's a rly rly rly helpful bug report!
Aye, I still need to give credit to the python code that inspired this (tho it's hard to do so since they violate ToS on alot of sites in their module).
It occurred to me late today that there are abt 4 other place some severe exception handling needs to take place so this rly helps triage one of them.
Sure thing, happy to help! Very cool package, this will come in handy for work stuffs. I'll have to check out the Python version.
Hi Bob,
Edit to add: I know this is a super new package and didn't expect everything to work perfectly, this issue is just to give you a heads up on the error!
Just checking this pkg out, and am running into and error. I tried running
carbondater::carbondate()
on this site: http://www.sdfda.gov.cn/art/2017/12/21/art_3715_190173.html (it's a page from the gov website of the Shandong province in China). The functions fails with:After looking into it some, what's causing the issue is that, within the site content, there's four
meta
tags that have aname
attribute but do NOT have acontent
attribute. So within functionget_earliest_pubdate()
,data.frame()
is being passed 25mtag
values but only 21mval
values.Here's some minimal code to reproduce the lengths disparity:
And here's the page content, printed as a char vector:
The problematic tags are
<meta name=\"ContentStart\">
and<meta name=\"ContentEnd\">
, they each appear twice.I have no idea how rare this edge case is, but I figured I'd give you a heads up. Let me know if you have any questions, or if there's anything else I can do help.