Closed Yusuf-Demiryurek closed 8 years ago
Hi, the snippet returns a corpus with 100 elements (don't know why you try to access element 150).
> library(tm)
> library(tm.plugin.webmiming)
> googlenews <- WebCorpus(GoogleNewsSource("Microsoft"))
> googlenews
# <<WebCorpus>>
# Metadata: corpus specific: 3, document level (indexed): 0
# Content: documents: 100
However, due to content download problems not all items contain content:
> out <- sapply(googlenews, function(x) nchar(content(x)))
> names(out) <- NULL
> out
[1] 10000 2411 1509 3332 6718 2176 3246 462 4254 5126 2586 1856
[13] 5758 2092 4092 3225 5060 968 5390 2573 3145 3088 2690 3750
[25] 2116 3588 1294 3516 1581 935 0 1260 1499 4715 1509 1565
[37] 1850 3069 2249 2068 3195 2082 5527 8293 5041 2365 14638 5207
[49] 2773 690 702 1293 2663 5861 1713 10006 6215 1112 1049 5883
[61] 0 1625 3561 2240 3999 9205 3078 1835 1682 2537 0 767
[73] 2149 3616 767 0 13036 767 6297 0 4429 8500 685 3819
[85] 0 0 8411 778 3957 5063 0 2800 1188 2522 4622 3135
[97] 767 0 7040 0
Therefore - no bug but could try to tune download parameters like chunk sizes, etc.
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
locale:
[1] C/UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-1 NLP_0.1-7 tm.plugin.webmining_1.3
loaded via a namespace (and not attached):
[1] parallel_3.2.2 tools_3.2.2 RCurl_1.95-4.7 slam_0.1-32
[5] RJSONIO_1.3-0 rJava_0.9-6 boilerpipeR_1.3.1 bitops_1.0-6
[9] XML_3.98-1.2
Hi,
When I use webcorpus for example: lv.googlenews <- WebCorpus(GoogleNewsSource("Microsoft"))
In each article there is no content [[150]] <>
Metadata: 7
Content: chars: 0
Thanks