Different `dataset_id` could link to the same dataset #59

Open hongyuanjia opened 1 year ago

hongyuanjia commented 1 year ago

dataset_id could not be used as the unique identifier of the dataset. It is specific to data node. This did not cause any problems for esgf_query(), but did result in duplicated entries in the results of init_cmip6_index() when replica is set to TRUE. Should use dataset_pid as the unique dataset identifier when building index.

q <- epwshiftr::esgf_query(
    activity = "ScenarioMIP",
    variable = "tas",
    frequency = "day",
    experiment = "ssp585",
    source = "AWI-CM-1-1-MR",
    variant = "r1i1p1f1",
    replica = TRUE,
    latest = TRUE,
    resolution = "100 km",
    limit = 10000L,
    data_node = NULL

q[, .(dataset_id, dataset_pid)]
#>                                                                                        dataset_id
#> 1:|
#> 2:|
#> 3:|
#> 4:|
#>                                          dataset_pid
#> 1: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 2: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 3: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8
#> 4: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8

unique(q[, -c("dataset_id", "data_node")])
#>    mip_era activity_drs institution_id     source_id experiment_id member_id
#> 1:   CMIP6  ScenarioMIP            AWI AWI-CM-1-1-MR        ssp585  r1i1p1f1
#>    table_id frequency grid_label  version nominal_resolution variable_id
#> 1:      day       day         gn 20190529             100 km         tas
#>              variable_long_name variable_units
#> 1: Near-Surface Air Temperature              K
#>                                          dataset_pid
#> 1: hdl:21.14100/a336f13f-a4d3-3b57-a45a-8f27f0ba01b8

Created on 2022-09-19 with reprex v2.0.2

hongyuanjia commented 1 year ago

Ref: [Identifiers](Returned Metadata Fields)