ropensci / c14bazAAR

R Package - Download and Prepare C14 Dates from Different Source Databases
https://docs.ropensci.org/c14bazAAR
GNU General Public License v2.0
30 stars 12 forks source link

IntChron parser #115

Open joeroe opened 4 years ago

joeroe commented 4 years ago

IntChron <https://intchron.org/> is listed in #2 as not being included because it requires web scraping. However, after spending some time playing with it, I wonder if this might be revisited.

Essentially, IntChron seems to do the same thing as c14bazAAR—systematically compile dates from existing databases—with a web-based API. An IntChron parser would be more complicated than the existing parsers because, as far as I can tell, there is no way to extract the entire database as a single file. But it should still be possible to get it without resorting to web scraping. The key is that every HTML page on IntChron can also be accessed in csv, json, or txt format. This includes the "index" pages that eventually lead you to individual date records. It think it could be worth the extra complexity for IntChron because it does seem to include a lot of dates (for example the entire ORAU database) and it's backed by the Oxford C14 Lab so it's likely to grow over time.

I can think of a few ways you could approach this, depending on how much flexibility you want to give the user. At the simplest, one could implement a multi-stage parser in c14bazAAR:

  1. Retrieve the list of "hosts" (https://intchron.org/host.csv)
  2. Retrieve the list of records-by-country for each host (e.g. https://intchron.org/oxa/record.csv)
  3. Retrieve the list of sites for each country (e.g. https://intchron.org/record/oxa/Jordan.csv)
  4. Retrieve the list of dates for each site (e.g. https://intchron.org/record/oxa/Jordan/Dhuweila.csv)
  5. Parse and collate the dates (actually quite easy because the IntChron format is similar to c14bazAAR's)

On the other end of the spectrum, one could write an R interface to IntChron as its own package, which c14bazAAR could then use as a dependency to retrieve either the entire database or a user-specific subset. That could be worthwhile if the IntChron standard does become widely used, but as things stand I'm not sure that it's worth the extra effort.

I'd be happy to put some work into this, but I thought I would first raise the issue and ask whether you think it is something that fits into c14bazAAR, and what the best approach to doing it might be.

nevrome commented 4 years ago

Very cool! - I was not aware of this option.

This indeed sounds like an application for an own package, because the data is not as monolithic as for most of the other "databases" (tables) in c14bazAAR. But writing a parser that simply collects everything may be a good first step towards that direction, as you can ignore the user input for now and nail down the tree merge algorithm first.

A PR would be very welcome! ORAU is extremely juicy.

joeroe commented 4 years ago

@nevrome That was my thinking too. I have a rough parser at joeroe/c14bazAAR/tree/intchron. It does seem to be worth it – crawling the full database returns over 11,000 dates, most of which are new for c14bazAAR:

intchron <- get_intchron("https://intchron.org/host")
# Or to save time:
# load("playground/intchron-cache-20201009.Rd")
length(unique(intchron$labcode))
#> [1] 11613

all <- get_c14data("all")
sum(!intchron$labcode %in% all$labnr)
#> [1] 9882

But it's extremely slow. Getting the whole database took about an hour on my fast university connection, because we have to make something like ~2000 separate HTTP requests.

So I'm thinking that splitting this off to its own package is a good idea after all. That way you could provide functions for getting subsets of the full IntChron database (e.g. by host/source, by country) and encourage the user to use that granularity in the c14bazAAR parser. Some sort of caching might also help.

nevrome commented 4 years ago

Alright - thanks for testing - excellent work! Downloading the whole thing is not feasible then and an own package for specific queries is clearly the way to go.

Maybe one solution to ensure the interoperability with c14bazAAR would be to use the c14_date_list data format for this new package?

joeroe commented 4 years ago

I've split the basic API interaction and querying off into its own package: joeroe/rintchron. I'll rewrite the parser on my intchron branch to use these instead. I also managed to get the time taken to retrieve the whole database down to 7 minutes (joeroe/rintchron#3), so I think we're close to it being viable to use as a normal c14bazAAR database, especially if there are separate parsers for ORAU, NCRF, etc.

nevrome commented 4 years ago

Great job! So we could go through intchron to get the data from different individual databases? We could write a parser function get_orau() which calls rintchron::intchron()?

joeroe commented 4 years ago

I think that's the way to go, yeah.