pbiecek / archivist

A set of tools for datasets and plots archiving
http://pbiecek.github.io/archivist/
74 stars 9 forks source link

Web Scraping with R #326

Closed chitemerere closed 6 years ago

chitemerere commented 6 years ago

I am trying to scrap a table with multiple pages from the web with R using the following code:

library(XML) library(RCurl) library(plyr) curlVersion()$features curlVersion()$protocol

fetchAllData <- function(page) { temp <- paste0("https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-", page, "-hs-code.html") data <- readHTMLTable(temp, stringsAsFactors = FALSE) data <- readHTMLTable(temp) frMW <- data.frame(data) }

fetchAll <- ldply(1:4, fetchAllData, .progress="text")

View(fetchAll)

i get the following error message:

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) : Results must be all atomic, or all data frames In addition: Warning messages: 1: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-1-hs-code.html' 2: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-1-hs-code.html' 3: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-2-hs-code.html' 4: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-2-hs-code.html' 5: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-3-hs-code.html' 6: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-3-hs-code.html' 7: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-4-hs-code.html' 8: XML content does not seem to be XML: 'https://www.zauba.com/export-trimethoprim/fp-zimbabwe/p-4-hs-code.html'

Please assist

Regards

pbiecek commented 6 years ago

not related to archivist you may be interested in the harvest package