rd-alliance / FAIR-data-maturity-model-WG

https://www.rd-alliance.org/group/fair-data-maturity-model-wg/case-statement/fair-data-maturity-model-wg-case-statement
13 stars 3 forks source link

Indicators prioritisation for Findability #30

Closed bahimc closed 4 years ago

bahimc commented 5 years ago

Dear members of the Fair data maturity model Working Group,

As a next step towards a common set of core assessment criteria, we started to explore the prioritisation of the indicators derived from the contributions to this very repository. As a result, all the indicators have been ranked according to three degrees of priority;

In the following Google spreadsheet, you can see the i) indicators in their final version and ii) their priorities.

F

Please have a look at the spreadsheet and let us know what are your thoughts, suggestions and concerns with regards to the ranking we attempted to make.

Thanks in advance for your contribution!

gepeng86 commented 5 years ago

Is it necessary to have two individual indicators from the machine-assessment perspective? If not, perhaps F1-01D and F1-02D can be combined as one: "Data is identified by a universally unique and persistent identifier"

makxdekkers commented 5 years ago

@gepeng86 It might be sufficient to have a combined indicator. We'll be happy to collapse the two to one, if the WG agrees.

However, there might be situations that data is identified by a unique identifier, e.g. a HTTP URI under the domain of an institution, without an explicit persistence policy for those identifiers. With the two separate identifiers, the evaluation would then flag the issue and recommend that the institution develop and publish a persistence policy.

kitchenprinzessin3880 commented 5 years ago

@makxdekkers With regard to F1, why the metrics are repeated for data and metadata? I am not sure in practice there will be a separate identifier assigned to a data and its metadata. I am interested to know real examples on this..

makxdekkers commented 5 years ago

@kitchenprinzessin3880 The duplication of indicators for F1 for data and metadata was proposed by members of the WG in the third meeting (report). Maybe a member of the WG can provide real examples of metadata records having universally unique, persistent identifiers?

Please note that we proposed the metadata-related indicators to be "Recommended" which could translate to "test if relevant".

kitchenprinzessin3880 commented 5 years ago

Please note that we proposed the metadata-related indicators to be "Recommended" which could translate to "test if relevant".

I think this is fine. Thank you :)

gepeng86 commented 5 years ago

@makxdekkers I see your point. I am fine with whatever the WG will agree on - a combined indicator or two separate indicators.

keithjeffery commented 5 years ago
  1. I believe persistence and uniqueness are different properties. If something has a unique ID and is persistent for a limited time that may still be useful for research.
  2. Certainly EPOS is designed with UUPIDs (Universally Unique Persistent IDs)for metadata records and for data records
bahimc commented 5 years ago

On the 19/08/2019, Mark D Wilkinson commented

Could the indicators for uniqueness and persistence of identifiers be combined in a single indicator?

uniqueness and persistence are not synonymous. http://cnn.com is unique, in that it cannot mean anything other than the CNN homepage. It is not persistent, however, because it is not pointing at the same content every time. Persistence is (in part) about the re-use of an GUID to point to another record... as happens all the time with web pages. (persistence also involves the longevity of the identifier, but that's not part of F(indability))

Why are there indicators concerning universally unique, persistent identifiers for metadata? Are there any examples where metadata (e.g. a metadata ‘record’) has its own identifier?

Every DOI on earth!

Read his comment here: https://www.rd-alliance.org/group/fair-data-maturity-model-wg/post/discussion-items-fair-data-maturity-model-19-august-2019

makxdekkers commented 5 years ago

@markwilkinson By "Every DOI on earth", do you mean that every DOI is the identifier of the metadata about a data resource, rather than the identifier of the data resource?

markwilkinson commented 5 years ago

Well, in my domain (biosciences) DOIs almost always resolve to a landing page, which generally contains metadata, and a link to the data, which has its own identifier. Moreover, and perhaps more importantly, if you do content-negotiation on the http://doi.org/xxx address, CrossRef or DataCite will intercept your HTTP call, and send you metadata directly! You never get to the source provider.

So, for those two reasons, in my personal experience, every DOI I have ever encountered as been the identifier of a metadata record, not a data record.

makxdekkers commented 5 years ago

@markwilkinson If the DOI is the persistent identifier of the metadata, does the data also have a persistent identifier? If so, what kind of persistent identifier?

keithjeffery commented 5 years ago

Mark is exposing the 'DoI problem'; basically the DoI/DataCite etc system has a phlosophy and architecture where humans read web pages (landing pages) and decide what to do; i.e. they read the landing page and decide - on the basis of the metadata (i.e. contextualisaion - relevance and quality of the digital asset described) whether to click the URL leading to the digital asset with the implicit action of downloading. The good news is that there is usually some metadata on the landing page describing the digital asset and that usually within any one domain (but usually not across domains) the metadata schema of the landing page is consistent. Practices vary but (to answer Makx) usually the DoI points to the landing page and the ID of the digital asset is simply the URL (a property/attribute/field within the landing page) used as URI.

Most major research infrastructures have a (sometimes >1) catalog of digital assets (datasets, data products, services, software, workflows, sensor networks, lab equipment, computers, persons (as experts or managers or...), organisations (as owners of equipment or funders or...). Query/selection on the metadata records in the catalog (i.e. finding meadata records describing assets that have certain values in properties/attributes/fields) is used to construct (progressively more autonomically) a workflow to satisfy the end-user request.

There are several problems with the DoI approach in the context described above. The major ones are: (1) it preclues autonomic access to assets: despite some efforts there is no guaranteed consistent formal syntax and semantics for landing pages so it is difficult to automate finding and using the URL of the digital asset described; (2) although it is possible (depending on local domain standards) to harvest metadata from landing pages into a canonical metadata catalog it is not easy and usually error-prone due to the lack of formal syntax and declared semanics (and lack of referential and functional integrity) in the landing page metadata; (3) with ever increasing dataset size and network latency download is progressively being deprecated in favour of (a) reducing dataset size locally to the actual records required for the purpose; (b) moving software to the data rather than data to the software (with implications of permissions, computing resource funding, security and privacy etc.); (4) with workflows including multiple datasets and services, especially if across several RIs (as - for example - in the environmental science domain) optimisation of locality of (subsets of) data, software services, computing resource for each processing step... becomes critically important. This optimisation requires consisent rich metadata records for the middleware to utilise in reasoning to construct an optimal workflow;

Various people have tried various mechanisms to either havest landing age metadata into a catalog or have the catalog record (with appropriate 'special' attributes) point to the URL within the landing page that points to the digital asset.

There are moves towards the idea of end-users activating services which provide access to data. This is to overcome some of the problems described above - not least resources, permissions, security, privacy. It will be interesing to see how the DoI community evolves in such a scenario.

I realise this is off topic for this indicator but wanted to share my experience of the 'DoI problem' as it relates to Findability (and also A,I,R). There is much valuable metadata in DoI landing pages and we need to find a way to integrate properly with 'mainstream' catalogs.

makxdekkers commented 5 years ago

@keithjeffery Would it mean that the the usual way of assigning and resolving DOIs, as described by @markwilkinson, in which the DOI in effect identifies the landing page, not the metadata or the data/digital resource, is incompatible with the proposed indicators for F1 and F3?

If so, we need to look again at these indicators, because it would not make sense for the indicators to reject a very common way of publishing data as not FAIR.

keithjeffery commented 5 years ago

I agree fully we need to find a way to include (as some kind of FAIR) the large amount of assets currently under DoI schemes. As I tried to indicate, the problem arises mainly for autonomic processing of the metadata and assets. My evaluation of DoI-based systems against the indicators at the top of this page is as follows: F1-01-M: pass F1-02-M: pass F1-01-D: fail (URL not necessarily persistent, if a DoI not URL pass) F1-02-D: fail (URL may not be unique, if a DoI not URL pass) F2-01-M: may pass or fail dependent on community standard and whether harvested to a catalog to be (re-) used with other metadata - here we start to 'spill over' into A,I,R F2-02-M: fail (examples I have seen) F3-01-M: fail - it is (usually) an address (URL) for the digital asset, not an identifier F4-01-M: pass or fail I have seen some examples of harvesting of landing pages but I am not at all sure it is general - perhaps @markwilkinson can advise? F4-02-M: pass or fail (usually from what I have seen) depending on the harvesing mechanism and the portal schema F4-03-M: pass or fail (usually from what I have seen) depending on the harvesing mechanism and the repository schema

Part of the problem is that - within the DoI community - there are different schemas for the metadata on the landing page (perhaps analogous to application profiles) although for many communities using DoI there seems to be consensus around using at least some of the Dublin Core schema. I am sure that harvesting of metadata from landing pages to catalogs/repositories can be done on a case-by-case or community-by-community basis (but perhaps not generally) and that the harvesting process may well entail measures to make the metadata more formal / consistent for autonomic processing.

I am not sure if the above helps or not but maybe we can - as a WG - come to some agreed conclusions that allow the DoI-described metadata and assets to be FAR. One posibility would be to distinguish - within each of FAIR - autonomic processing from manual processing?

makxdekkers commented 5 years ago

@keithjeffery Based on your analysis, we may need to downgrade F-01D, F1-02D and F3-01M from mandatory to recommended; otherwise most of the (meta)data in currently common DOI-based publishing approaches would have to be declared not-FAIR. Making these indicators recommended would signal that it would be more FAIR to have persistent identifiers for metadata and for data, not in terms of pass/fail, but as a suggestion for future improvement.

keithjeffery commented 5 years ago

@makxdekkers I believe they could be mandatory for DoI - based systems that usilised one of the mechanisms for ensuring the UUPID of the asset was both unique and persistent (not a URL), was associated with a URL to access the asset and both were consistently identified (as a property/attribute/field) within the metadata stored on the landing page.

Alternatively, (and here I'd like to hear @markwilkinson views) mechanisms have been suggested in Harvey et al., 2015; J.ChemInform., 7:37 DOI 10.1186/s13321-015-0081-7

The essential underlying problem is one of philosophy; DoI systems were designed for publcations where (in general) processing is by a human and so the 'read landing page and decide' mechnism is appropriate. Using DoI systems for datasets precludes general autonomic access.

makxdekkers commented 5 years ago

@keithjeffery @markwilkinson Would you agree with an approach that acknowledges that the DOI/DataCite/CroffRef-based publication approach may not be exactly what the FAIR principles envisage for the future, but that it is a useful step towards that future? I was thinking that maybe the indicators could be formulated in such a way that it would recognise that the DOI really has the intention to identify the data/digital resource but that the resolution is through a landing page. It would mean that resources published through DataCite and CrossRef would pass F1-01D and F1-02D (because the resource is identified, albeit indirectly, by a DOI) as well as F3 (because the DOI is included in the metadata). Such an approach would avoid that a lot of existing data would need to be declared not-FAIR, while people follow recommended best practice.

rwwh commented 5 years ago

This is a very important discussion, showing the weakness of any yes/no assessment of FAIRness. IMO current DOI is indeed much more FAIR than a URI, but we should not rest on our laurels and instead continue working on improving future FAIRness.

I don't know who said it first: "FAIR is a journey, not a destination".

nsjuty commented 5 years ago

For F1-01M & 02M it worth noting that quite often metadata is buried in with data, so would be accessed with the same identifier. This may be as schema.org or using other metadata schema.

nsjuty commented 5 years ago

We need to be careful of over-specifying where metadaat is harvested from. I'm sure crawlers have no interest in how we have designated a specific resource. We need to allow for availability through boutique resources, community or institutional and large scale generic catalogues or portals. Perhaps the wording could be improved F4-01,02 &03M, or even condense into one indicator?

makxdekkers commented 5 years ago

@nsjuty

For F1-01M & 02M it worth noting that quite often metadata is buried in with data, so would be accessed with the same identifier. This may be as schema.org or using other metadata schema.

In that case, would the same identifier satisfy both the F1 indicators for metadata and for data?

makxdekkers commented 5 years ago

@nsjuty

We need to be careful of over-specifying where metadaat is harvested from. I'm sure crawlers have no interest in how we have designated a specific resource. We need to allow for availability through boutique resources, community or institutional and large scale generic catalogues or portals. Perhaps the wording could be improved F4-01,02 &03M, or even condense into one indicator?

It would be possible to collapse the three indicators to one, for example, based on text I suggested at https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/18#issuecomment-522314010:

Metadata is offered/published/exposed in such a way that it can be harvested and indexed.

And leave it unspecified how and by whom the harvesting and indexing is done?

kitchenprinzessin3880 commented 5 years ago

@nsjuty

We need to be careful of over-specifying where metadaat is harvested from. I'm sure crawlers have no interest in how we have designated a specific resource. We need to allow for availability through boutique resources, community or institutional and large scale generic catalogues or portals. Perhaps the wording could be improved F4-01,02 &03M, or even condense into one indicator?

It would be possible to collapse the three indicators to one, for example, based on text I suggested at #18 (comment):

Metadata is offered/published/exposed in such a way that it can be harvested and indexed.

And leave it unspecified how and by whom the harvesting and indexing is done?

@makxdekkers @nsjuty collapse the three indicators to one +1 (this is simple and straightforward, and considers other data discovery solutions which we may not aware of). further, we may provide some examples on harvesting/indexing mechanisms if required.

hervelh commented 5 years ago

General Comment across FAIR. It seems clear across the FAIR indicators that there are dependencies on “discipline-domain”, “cross-domain” and “community” context. I suggest that these will in turn impact whether tests against the indicators can be human-mediated vs machine-mediated. Should the indicators themselves seek to be as generic as possible rather than address human vs machine mediation at this stage? i.e Generic > Contextual > Human/Machine-specific as the three concentric rings of FAIR evaluation? E.g. Generic (metadata is provided Y/N), Contextual (DDI metadata supports Social Science data), Human/Machine (DDI.xml is machine-harvestable).

hervelh commented 5 years ago

Some of the comments seem to indicate that we have a number of different ‘object models’ in our heads with regards to the data/metadata split. I think we need to guard against assumptions that our own local experience is universally applicable.

E.g. my own PID and object discovery metadata (in DataCite and other resource discovery systems) will likely resolve to a (complex) object which contains a mixture of additional metadata (more rich and disciplinary-specific than the purely discovery metadata), prose documentation files and data files.

The DataCite (or another PID provider) metadata submitted when I mint my DOI includes a resolution target. The target could be a landing page, a digital object at varying levels of complexity or all my data in a single, tidy data file.

Do most data lifecycle actors simply consider all of these to be part of the (distributed) “object” identified by its PID?

DataCite/PID service providers local systems are bound to have their own unique/persistent ID for a metadata record. But won’t necessarily expose it. That metadata ‘object’ and its identifier exist, but won’t necessarily form part of most curator’s mental ‘object model’. Differentiation between data and metadata across a vast range of objects is non-trivial to humans as well as machines.

Metadata standards like DDI may support both contextual metadata and the data points themselves in the same file. That said, any object whose DOI is minted through a stable PID system like DataCite can relatively trivially guarantee that the ID is unique and ‘supports’ persistence. However, whether persistence is actually assured over time depends on the practices of the data steward (repository etc),

* Uniqueness and persistence are certainly different properties. Using (e.g.) DataCite ensures uniqueness computationally. Eternal persistence cannot be guaranteed for any object, but “A2 metadata are accessible, even when the data are no longer available.” clarifies that metadata persistence beyond the life or accessibility of the object is a clear goal.

“ re: the comment asking if we should “recognise that the DOI really has the intention to identify the data/digital resource but that the resolution is through a landing page.”

Yes, because: a. The comment already contains the assumption that ‘data’ and ‘digital resource’ are not synonymous and b. for any digital object which requires an auth/auth process the PID resolution will be to some intermediate (landing) target. This will always create some level of barrier to “data” access, including for indicator assessment.

* Re: “This is a very important discussion, showing the weakness of any yes/no assessment of FAIRness”

Agreed, the challenge is to move forward with developing indicators and associated assessments without penalising metadata/data/objects (or their stewards) which don’t meet initial assumptions.

*

“F1 F1-02D Data is identified by a universally unique identifier” whereas the principle uses globally. Unless we’re seeking to revise the principles should the language remain the same? *

“F4 F4-01M Metadata or landing page is harvested by general search engine” “F4 F4-02M Metadata is harvested by or submitted to domain/discipline-specific portal”

I concur with prior comments on harvest vs submit. The push vs pull is not the critical factor.

Is the goal here to ensure that metadata works with, and is available through, both general-purpose resource systems and domain/discipline-specific resource discovery systems?

*

“F4 F4-03M Metadata is indexed in institutional repository”

Is there a reason for limiting this to an “institutional” repository (as opposed to a domain/discipline-specific or another type of repository).

If improved findability (through managed persistence) of the metadata is the argument for recommending it be in a repository, shouldn’t there be an equivalent indicator recommending the data is stored in a repository?

makxdekkers commented 5 years ago

@hervelh

Should the indicators themselves seek to be as generic as possible rather than address human vs machine mediation at this stage?

I think that it is the most realistic to start with the generic level. There are references to discipline-specific standards where possible, but there is also to possibility to add indicators for discipline-specific application -- the idea is that the model is extensible. A set of 'core' criteria -- which the group is chartered to develop -- is intended to allow comparisons across domains; additional criteria would add more specific requirements for a particular domain.

makxdekkers commented 5 years ago

@hervelh

One comment suggests “translating “recommended” as test if relevant. Instead, could we add a category of “mandatory if applicable”? I assume applicability will depend on “discipline-domain”, “cross-domain” and “community” context?

We can of course consider to add a new category. However, we tried to apply categories from RFC2119 which defines:

SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

It seems to me that there is not much difference between "test if relevant" and "mandatory if applicable". The problem for evaluation is in both cases to determine whether the characteristic is 'relevant' or 'applicable' -- this might be subjective.

makxdekkers commented 5 years ago

@hervelh

“F1 F1-02D Data is identified by a universally unique identifier” whereas the principle uses globally. Unless we’re seeking to revise the principles should the language remain the same?

This formulation was proposed and adopted by the WG in workshop 3.

makxdekkers commented 5 years ago

@hervelh

Is the goal here to ensure that metadata works with, and is available through, both general-purpose resource systems and domain/discipline-specific resource discovery systems?

The idea is that the indicators would be either/or. Please note there is a proposal at https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/18#issuecomment-526537395 to collapse the three indicators for F4 to a single one.

Metadata is offered/published/exposed in such a way that it can be harvested and indexed.

rwwh commented 5 years ago

It seems to me that there is not much difference between "test if relevant" and "mandatory if applicable". The problem for evaluation is in both cases to determine whether the characteristic is 'relevant' or 'applicable' -- this might be subjective. (@makxdekkers )

One potentially significant difference is that "test if relevant" gives no test result if it is irrelevant, and "mandatory if applicable" may give a non-fatal but failed test result. This can result in different calculations of composite scores.

makxdekkers commented 5 years ago

@rwwh @hervelh Another way to resolve this might be to define 'Recommended' as 'mandatory if applicable'. Would that work? Is there a need for a separate category 'test if relevant'?

andrasholl commented 5 years ago

A general comment - I think references to publications related to the given data set would be a good thing to recommend

makxdekkers commented 5 years ago

@andrasholl Would links between datasets (and other digital objects) and publications be recorded in the metadata for the dataset, or would it rather be in the metadata for the publication? If in the metadata for the dataset, the metadata would need to be updated every time someone publishes a paper that references it; if in the metadata for the publication, there is no need to update the metadata later.

andrasholl commented 5 years ago

Hi!

I would say both ways. Publications should always refer to data used - this way is already established. But if there is a publication about the project generating the data, about the instrument(s) used, or even better, a publication based on the dataset in question, that should be referred in the metadata of the dataset. Publications are for human readers, but often they provide the deepest, most complex background information on the dataset.

One could imagine a scenario where a researcher does observations/simulations/experiments, then writes a paper with the results, and in the paper includes a reference (a DOI) to the datasets used, and in the datasets include in the metadata a reference (a DOI) to the publication or the preprint (say in arXiv).

Cheers, Andras

----- Original Message ----- From: "makxdekkers" notifications@github.com To: "RDA-FAIR/FAIR-data-maturity-model-WG" FAIR-data-maturity-model-WG@noreply.github.com Cc: "holl andras" holl.andras@konyvtar.mta.hu, "Mention" mention@noreply.github.com Sent: Tuesday, 3 September, 2019 14:32:41 Subject: Re: [RDA-FAIR/FAIR-data-maturity-model-WG] Indicators prioritisation for Findability (#30)

@andrasholl Would links between datasets (and other digital objects) and publications be recorded in the metadata for the dataset, or would it rather be in the metadata for the publication? If in the metadata for the dataset, the metadata would need to be updated every time someone publishes a paper that references it; if in the metadata for the publication, there is no need to update the metadata later.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/RDA-FAIR/FAIR-data-maturity-model-WG/issues/30#issuecomment-527437949 -- Holl András informatikai főigazgató-helyettes / deputy director (IT) MTA Könyvtár és Információs Központ / MTA Library and Information Centre

keithjeffery commented 5 years ago

@andrasholl There is relevant discussion under rich metadta on this. Basically @markwilkinson and I were agreeing vehemently that the metadata has to be a fully connected graph so that it is possible to have meaningful relationships expressed between all assets (data, publication, service, software, computing resource, person, institution, lab equipment, sensors.......).

@makxdekkers I think the links are not recorded in the 'metadata record' (in the catalog card sense) of a dataset but live as separate 'linking objects' created at a certain time for a certain purpose (i.e. a relationship between two UUPIDs with role and temporal duration). This avoids the update problem on the base entities (dataset, publication etc) and respects referential and functional integrity (the dataset(s) and publication(s) exist whether or not a relationship between them is recorded).

andrasholl commented 5 years ago

If there are relevant publications available in the time of the data deposit, reference to them could be included in a static way in the metadata. On the other hand, data centers could mine data citations from CrossRef, and provide the list of citing papers - just like for publications

makxdekkers commented 5 years ago

@keithjeffery If the links are not in the metadata but "live as separate 'linking objects'", where do they live and who maintains them? Is what you argue similar to what @andrasholl calls "data citations" mined by data centres and kept in a list of citing papers? If this is what you mean, how would you want to formulate an indicator?

makxdekkers commented 5 years ago

@keithjeffery @andrasholl Could this also be part of the wider discussion on 'rich metadata' and 'plurality of attributes'?

keithjeffery commented 5 years ago

@makxdekkers I apologise for being unclear. I am suggesting metadata may be separated into that concerning digital objects of interest (e.g. datasets, publications, persons, organisations etc) and that concerning relationships (links) between them (e.g. dataset A is is linked to publication S with role referenced and tempora duration datetimestart yyyymmdd datetime end yyyymmdd and then dataset A is linked to organisation O role owner datetimestart yyyymmdd datetime end yyyymmdd and publication S is linked to Person P role author datetimestart yyyymmdd datetime end yyyymmdd ...... and so on. This provides a rich fully conected graph of metadata which has both referential and functional integrity and - as a by-product - provides curation and provenance information. The information for the links could indeed be mined from the various sources but depends on UUPIDs for the base objects (e.g. dataset, publication, person) and vocabulary control for the roles. Indeed this links to the discussion on rich metadata!

hervelh commented 5 years ago

@hervelh

Should the indicators themselves seek to be as generic as possible rather than address human vs machine mediation at this stage?

I think that it is the most realistic to start with the generic level. There are references to discipline-specific standards where possible, but there is also to possibility to add indicators for discipline-specific application -- the idea is that the model is extensible. A set of 'core' criteria -- which the group is chartered to develop -- is intended to allow comparisons across domains; additional criteria would add more specific requirements for a particular domain.

@makxdekkers A set of globally applicable indicators and a supporting extensibility model which supports 'local' context sound effective. Maintaining consistency across more explicit disciplines/domain indicators is desirable.

hervelh commented 5 years ago

@hervelh

“F1 F1-02D Data is identified by a universally unique identifier” whereas the principle uses globally. Unless we’re seeking to revise the principles should the language remain the same?

This formulation was proposed and adopted by the WG in workshop 3.

@makxdekkers Is there a DOI for a canonical version of the FAIR Principles which could be versioned over time?

hervelh commented 5 years ago

@makxdekkers @rwwh I support the use of RFC2119 and would certainly avoid using the terms in any way which conflicts with the RFC. The discussion might be if (and how) to apply any additional validation/relevancy checks at the tier beyond the current globally applicable indicators’ i.e. extensibility to the domain/discipline-specific indicators. I get the feeling that beyond the global level there will always be a degree of subjectivity, which I acknowledge will impact testing.

hervelh commented 5 years ago

@hervelh

Is the goal here to ensure that metadata works with, and is available through, both general-purpose resource systems and domain/discipline-specific resource discovery systems?

The idea is that the indicators would be either/or. Please note there is a proposal at #18 (comment) to collapse the three indicators for F4 to a single one.

Metadata is offered/published/exposed in such a way that it can be harvested and indexed.

@makxdekkers Perhaps if the indicators are collapsed the ‘type’ of resource discovery systems would be addressed at a discipline/domain context level? I’d certainly suggest that best FAIR practice for research data would be to have the metadata findable in both general-purpose and disciplinary-specific resources?

hervelh commented 5 years ago

@keithjeffery @andrasholl Could this also be part of the wider discussion on 'rich metadata' and 'plurality of attributes'?

@keithjeffery @andrasholl @makxdekkers I suggest this is also I3 (Qualified References). @keithjeffery presents a tempting vision of the linked graph of complex digital (meta) data objects. These are one possible object model. But at this global indicator level I assume that we need to cover a range of object types: those which are part of a perfectly organised graph, complex collections of semi-structured files, or a nice neat single file of rectangular data (and the rest).

gepeng86 commented 5 years ago

Metadata is offered/published/exposed in such a way that it can be harvested and indexed.

One question about this one kept coming back to me:

Is this indicator intended to measure the way (offered/published/exposed) or the end state (harvested/indexed)?

Perhaps just additional explanation is needed when finalizing it.

makxdekkers commented 5 years ago

@hervelh

Is there a DOI for a canonical version of the FAIR Principles which could be versioned over time?

This is a question for the people who conceived the FAIR principles. Maybe @markwilkinson , @SusannaSansone, others?

makxdekkers commented 5 years ago

@gepeng86 The way the indicator is formulated is about the requirement on the data provider to make sure the metadata is available for harvesting and indexing. The data provider should try to get the metadata indexed but the actual harvesting and indexing is the responsibility of the search system (portal, repository etc.). Should the requirement be stronger for the data provider?

gepeng86 commented 5 years ago

@makxdekkers Thanks for your quick response. As a recommended indicator, the current requirement on the data provider/publisher should be fine.

makxdekkers commented 5 years ago

@gepeng86 It could maybe be argued that the indicator should be mandatory. I find it hard to imagine situations where a data provider would have a good reason not to offer, publish or expose the metadata.

helenp commented 5 years ago

Meta data is indexed in an institutional repository - why institutional - how many of these are discoverable or open? The standard in life sciences for public data is to use community repos - suggest rephrase to an 'appropriate repository'