quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Add support for Nexis HTML #115

Closed koheiw closed 6 years ago

koheiw commented 6 years ago

The Nexis database is a popular source of texts. People can download news articles in HTML format, but they need a tool to import texts. There is a plug in for tm, so why readtext do not have the same function? https://cran.r-project.org/web/packages/tm.plugin.lexisnexis/index.html

I have code to offer, but not sure how to tell readtext() that HTML are from Nexis.

koheiw commented 6 years ago

The code in my private repo looks like this:

#' extract texts and meta data from Nexis HTML files
#'
#' This extract headings, body texts and meta data (date, byline, length,
#' secotion, edntion) from items in HTML files downloaded by the scraper.
#' @param path either path to a HTML file or a directory that containe HTML files
#' @param paragraph_separator a character to sperarate paragrahphs in body texts.
#' @param language_date a character to specify langauge-dependent date format.
#' @param raw_date return date of publication without parsing if \code{TRUE}.
#' @export
#' @examples
#' irt <- import_nexis('tests/html/irish-times_1995-06-12_0001.html')
#' afp <- import_nexis('tests/html/afp_2013-03-12_0501.html')
#' gur <- import_nexis('tests/html/guardian_1986-01-01_0001.html')
#' sun <- import_nexis('tests/html/sun_2000-11-01_0001.html')
#' spg <- import_nexis('tests/html/spiegel_2012-02-01_0001.html', language_date = 'german')
#' all <- import_nexis('tests/html', raw_date = TRUE)
import_nexis <- function(path, paragraph_separator = '|', language_date = c('english', 'german'), raw_date = FALSE){
}

https://github.com/koheiw/Nexis/blob/master/R/importer.R

kbenoit commented 6 years ago

Is there any special marker for the Nexis HTML format?

We have a similar problem with Twitter JSON and generic JSON. Either we detect the type from metadata or distinguishing markers in the files, or we add an option called source that can take values appropriate to the type of file (with a lot of checks to prohibit invalid combinations).

So it would be

readtext("your-nexis-file.html", source = "nexis")

but then the signature would be a bit inconsistent, such as:

readtext(x, ..., source = c("auto", "twitter", "nexis", allothervalues))

where specifying the value for source manually would override the auto-type detection based on filename extensions.

koheiw commented 6 years ago

They are messy legacy HTML files without reliable markers. London_Times2018-02-06_14-04.zip The source option seems to work, passing other argument via ... to the underlying importer.