mannau / tm.plugin.webmining

Retrieve structured, textual data from various web sources.
34 stars 10 forks source link

No content in webcorpus #13

Closed Yusuf-Demiryurek closed 8 years ago

Yusuf-Demiryurek commented 8 years ago

Hi,

When I use webcorpus for example: lv.googlenews <- WebCorpus(GoogleNewsSource("Microsoft"))

In each article there is no content [[150]] <> Metadata: 7 Content: chars: 0

Thanks

mannau commented 8 years ago

Hi, the snippet returns a corpus with 100 elements (don't know why you try to access element 150).

> library(tm)
> library(tm.plugin.webmiming)
> googlenews <- WebCorpus(GoogleNewsSource("Microsoft"))
> googlenews
# <<WebCorpus>>
# Metadata:  corpus specific: 3, document level (indexed): 0
# Content:  documents: 100

However, due to content download problems not all items contain content:

> out <- sapply(googlenews, function(x) nchar(content(x)))
> names(out) <- NULL
> out
  [1] 10000  2411  1509  3332  6718  2176  3246   462  4254  5126  2586  1856
 [13]  5758  2092  4092  3225  5060   968  5390  2573  3145  3088  2690  3750
 [25]  2116  3588  1294  3516  1581   935     0  1260  1499  4715  1509  1565
 [37]  1850  3069  2249  2068  3195  2082  5527  8293  5041  2365 14638  5207
 [49]  2773   690   702  1293  2663  5861  1713 10006  6215  1112  1049  5883
 [61]     0  1625  3561  2240  3999  9205  3078  1835  1682  2537     0   767
 [73]  2149  3616   767     0 13036   767  6297     0  4429  8500   685  3819
 [85]     0     0  8411   778  3957  5063     0  2800  1188  2522  4622  3135
 [97]   767     0  7040     0

Therefore - no bug but could try to tune download parameters like chunk sizes, etc.

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-1                NLP_0.1-7               tm.plugin.webmining_1.3

loaded via a namespace (and not attached):
[1] parallel_3.2.2    tools_3.2.2       RCurl_1.95-4.7    slam_0.1-32      
[5] RJSONIO_1.3-0     rJava_0.9-6       boilerpipeR_1.3.1 bitops_1.0-6     
[9] XML_3.98-1.2