poseidon-framework / community-archive

The Poseidon Community Archive (PCA)
https://www.poseidon-adna.org/#/archive_overview
10 stars 25 forks source link

Missing Date_Types in 12 packages #97

Open stschiff opened 2 years ago

stschiff commented 2 years ago

Lots of packages contain missing Date_Types in the Janno file. In my, a lot of those we should be able to fill easily:

a. If there are entries in the C14-type columns, put Date_Type to C14. b. If there are entries in the calbrated columns, but not in the C14-columns, put Date_Type to contextual. c. If it's modern samples, put to modern. d. If the sample is ancient, but there is no date at all, keep at n/a for now, but of course those we should anyway also fill soon, at least as a contextual range, which should always be possible from a look into the paper.

published_data % trident list --individuals -d . -j Date_Type --raw | awk '$4 == "n/a"' | cut -f1 | sort | uniq -c
   5 2020_Brunel_France
   1 2020_Cassidy_IrishDynastic
  12 2020_Furtwaengler_Switzerland
  20 2020_Nakatsuka_SouthPatagonia
  30 2020_Ning_China
   1 2020_Wang_subSaharanAfrica
  24 2020_Yang_China
  40 2021_Kilinc_northeastAsia
 826 2021_PattersonNature
  18 2021_Saag_EastEuropean
  22 2021_SaupeCurrBiol
 383 2021_Wang_EastAsia
nevrome commented 2 years ago

I'm slowly crawling out of my hole and thought I quickly take a peek into this. Dana and I concluded back then for #25 that there is unfortunately a lot of d. in the mix. This might have changed now, so let's see. c. is trivial (although I think there is no automatic way to find these samples, right?), so let's check a. and b.

a. should be an impossible state of the system, so it would surprise me if it exists:

https://github.com/poseidon-framework/poseidon-hs/blob/6be96d0a933b564cfa017471aedaa30a32a7ebd0/src/Poseidon/Janno.hs#L820-L831

I checked anyway:

janno <- poseidonR::read_janno("~/agora/published_data/")

### If there are entries in the C14-type columns, put Date_Type to C14.

janno_with_actual_C14_dates <- janno %>% dplyr::filter(
  # do not include dates for which applies
  !purrr::map_lgl(Date_C14_Uncal_BP, \(x) {
    is.null(x) ||           # date is NULL
      if (length(x) == 1) { # if there is exactly one date value
        is.na(x)            # date is NA
      } else {
        FALSE
      }
  })
)

janno_with_actual_C14_dates %>% nrow # 3606
janno %>% dplyr::filter(Date_Type == "C14") %>% nrow # 3607
janno_with_actual_C14_dates %>%
  dplyr::filter(is.na(Date_Type) | Date_Type != "C14") %>% nrow() # 0

So I think such a sample does indeed not exist. b. is a lot more likely.

### If there are entries in the calibrated columns, but not in the C14-columns, put Date_Type to contextual.

janno_with_result_dates <- janno %>% dplyr::filter(
  !is.na(janno$Date_BC_AD_Median)
)

janno_potentially_contextual <- dplyr::anti_join(
  janno_with_result_dates,
  janno_with_actual_C14_dates,
  by = "Poseidon_ID"
)

janno_potentially_contextual %>%
  dplyr::filter(is.na(Date_Type) | Date_Type != "contextual") %>%
  nrow # 840 

OK! So we could automatically fill these 840 (826 from 2021_PattersonNature) with contextual. I fear this will often be factually incorrect, but it makes our DB consistent. We should also make sure that b. is caught by the validation and can not emerge any more in the future.

Btw. my brain is still pretty mushy so take this with a grain of salt.

stschiff commented 2 years ago

OK, good catch that 826 of the missing date infos with calibrated dates are from Patterson. I think we should then open a separate issue to fill in the uncalibrated dates for these, as I think they must have C14-dated most if not all of their samples.