Closed koheiw closed 6 years ago
The code in my private repo looks like this:
#' extract texts and meta data from Nexis HTML files
#'
#' This extract headings, body texts and meta data (date, byline, length,
#' secotion, edntion) from items in HTML files downloaded by the scraper.
#' @param path either path to a HTML file or a directory that containe HTML files
#' @param paragraph_separator a character to sperarate paragrahphs in body texts.
#' @param language_date a character to specify langauge-dependent date format.
#' @param raw_date return date of publication without parsing if \code{TRUE}.
#' @export
#' @examples
#' irt <- import_nexis('tests/html/irish-times_1995-06-12_0001.html')
#' afp <- import_nexis('tests/html/afp_2013-03-12_0501.html')
#' gur <- import_nexis('tests/html/guardian_1986-01-01_0001.html')
#' sun <- import_nexis('tests/html/sun_2000-11-01_0001.html')
#' spg <- import_nexis('tests/html/spiegel_2012-02-01_0001.html', language_date = 'german')
#' all <- import_nexis('tests/html', raw_date = TRUE)
import_nexis <- function(path, paragraph_separator = '|', language_date = c('english', 'german'), raw_date = FALSE){
}
Is there any special marker for the Nexis HTML format?
We have a similar problem with Twitter JSON and generic JSON. Either we detect the type from metadata or distinguishing markers in the files, or we add an option called source
that can take values appropriate to the type of file (with a lot of checks to prohibit invalid combinations).
So it would be
readtext("your-nexis-file.html", source = "nexis")
but then the signature would be a bit inconsistent, such as:
readtext(x, ..., source = c("auto", "twitter", "nexis", allothervalues))
where specifying the value for source
manually would override the auto-type detection based on filename extensions.
They are messy legacy HTML files without reliable markers.
London_Times2018-02-06_14-04.zip
The source
option seems to work, passing other argument via ...
to the underlying importer.
The Nexis database is a popular source of texts. People can download news articles in HTML format, but they need a tool to import texts. There is a plug in for tm, so why readtext do not have the same function? https://cran.r-project.org/web/packages/tm.plugin.lexisnexis/index.html
I have code to offer, but not sure how to tell
readtext()
that HTML are from Nexis.