ropenscilabs / deposits

R Client for access to multiple data repository services
https://docs.ropensci.org/deposits/
Other
37 stars 3 forks source link

Marker metadata for tracking #43

Open noamross opened 1 year ago

noamross commented 1 year ago

Is it possible for us to include in metadata information that would allow us to search for records created by deposits? Perhaps with some opt-in mechanism like "Would you like to add the keyword deposits-client to make it possible for us to track records created by deposits" at deposition time? Ideally its something in a minimally obstrusive or visible metadata field but something we could find via the various search APIs across repositories.

mpadge commented 1 year ago

Yep, indeed. Here it is in action: https://github.com/ropenscilabs/deposits/blob/3b32f25b53ee9901e0d6d3878763e395669a2dd4/R/client-main.R#L296 Just need to resolve the precise keyword which will be used for deposits rather than frictionless, for which "deposits" is sufficiently unambiguous, but perhaps not sufficiently informative? I imagine the procedure will be automatic, rather than opt-in. The data are still exposed to, and controllable by, users, so in my current view only require demonstration of the possibilitity of manually removing the keyword.

noamross commented 1 year ago

Cool. While I'd love to have high coverage data I'd want to be maximally transparent and opt-in with this.

Yes, "deposits" is probably not a great keyword. (This makes me wonder if another name would be better for the package). That said, if there is already a "frictionlessdata" tag, perhaps we could only place our marker inside data package.json, and we could query that file in repositories with that tag. It would be more intensive but doable, and avoid tag cluttering that users might not want.

noamross commented 1 year ago

That would actually be a great example for a tutorial!

mpadge commented 1 year ago

The neccessary precursor issue of keywords #36 is now done. Output copied here to demonstrate functionality needed for this issue. Keywords always have to be defined in "subjects", not "description". The following code illustrates the new functionality, starting with what happens when "keywords" are defined in the wrong field:

library (deposits)
packageVersion ("deposits")
#> [1] '0.1.0.53'
metadata <- list (
    title = "New Title",
    abstract = "This is the abstract",
    creator = list (list (name = "A. Person"), list (name = "B. Person")),
    description = paste0 (
        "This is the description\n\n",
        "## keywords\none, two\nthree\n\n## version\n1.0"
    )
)
cli <- depositsClient$new (service = "zenodo", metadata = metadata, sandbox = TRUE)
#> Error: Metadata source for [keywords] should be [subject] and not [description]
cli <- depositsClient$new (service = "figshare", metadata = metadata)
#> Error: Metadata source for [keywords] should be [subject] and not [description]

The error message for both services is sufficiently informative to know what to do next:

metadata$description <- "This is the description\n\n## version\n1.0"
metadata$subject <- "## keywords\none, two\nthree"
cli <- depositsClient$new (service = "zenodo", metadata = metadata, sandbox = TRUE)
cli$deposit_new ()
#> ID of new deposit : 1177062
cli$hostdata$metadata$keywords
#> [[1]]
#> [1] "one"
#> 
#> [[2]]
#> [1] "two"
#> 
#> [[3]]
#> [1] "three"

cli <- depositsClient$new (service = "figshare", metadata = metadata)
cli$deposit_new ()
#> Files for private Figshare deposits can only be downloaded manually; no metadata can be retrieved for this deposit.
#> ID of new deposit : 22348531
cli$hostdata$tags
#> [1] "one"   "two"   "three"

Created on 2023-03-28 with reprex v2.0.2

And keywords are appropriately translated into service-specific terms, with the services themselves then returning their own representations. This issue then just needs optional or automatic insertion of a deposits-specific keyword, potentially alongside the "frictionlessdata" keyword illustrated in this Zenodo search query.

@peterdesmet Can you comment on any "official" frictionless positions on the use of such keywords? Is "frictionlessdata" supported or encouraged, or just something you personally use? (Seems to be the latter from the Zenodo records.) Do you have any adivce or recommendations for us to extend upon your own usage to flag our own as a direct extension of frictionless? Any advice or input would be really appreciated :+1: :smile:

peterdesmet commented 1 year ago

What keyword to use (frictionlessdata vs frictionless) was recently brought up in the Frictionless Slack, but I don't think it was conclusive. I have referenced this issue there and I'm tagging Community Manager @sapetti9 here. 😄

Regarding:

Do you have any adivce or recommendations for us to extend upon your own usage to flag our own as a direct extension of frictionless? Any advice or input would be really appreciated

Can you clarify your use case? Is it "what keywords to automatically assign to a deposit in Zenodo/... that was created with the deposits package?"

mpadge commented 1 year ago

Can you clarify your use case? Is it "what keywords to automatically assign to a deposit in Zenodo/... that was created with the deposits package?"

Yes, that is precisely what I meant. We are intending to have a (likely optional, but possibly default) keyword that we can use to identify all deposits created via this package. And those will also likely include an additional keyword to align with your current "frictionlessdata" usage. So ultimately two keywords.

peterdesmet commented 1 year ago

I think the proper way to do it would be to assign a related identifier with relationType=IsCompiledBy. This is defined in Data Cite Schema as "indicates B is used to compile or create A". I think this applies here.

In any case, I tried it out for one of the animal tracking datasets I published with the movepub R package: https://doi.org/10.5281/zenodo.5653311 Here's how it looks:

Screenshot 2023-03-28 at 18 06 13
mpadge commented 1 year ago

That's a great idea! deposits builds from a DCMI metadata structure which includes a few terms in which that might fit. And Zenodo has a "related_identifiers" field which allows the compiled option, and it also has the ability to construct custom search queries on any fields. So that should work for that, which will mean also for Dryad, which we'll soon expand to. (We currently do figshare too, but full functionality there is not so important.)