Pacific portals - Githubissues

pieterprovoost commented 4 years ago

The goal of this issue is to review the technology, standards, and readiness level of the Pacific data portals. The following SPREP and SPC portals have been identified as potential data sources for the OceanInfoHub in the PSIDS region:

Pacific Environmental Portal
- technology: Drupal, CKAN (seems to harvest from national CKAN data portals)
- data types: datasets (including maritime boundaries), publications
- structured data: none detected using Structured Data Testing Tool, RDF export links often do not work
Pacific Data Hub
- technology: Drupal, DKAN, DCAT-RDF
- content types: datasets, publications
- structured data: none detected using Structured Data Testing Tool
Pacific Ocean Portal
- technology: Python, WMS
- data types: spatial datasets
- structured data: none
Pacific Catastrophe Risk Assessment and Financing Initiative (PCRAFI)
- data types: spatial datasets, documents
- technology: GeoNode, CSW, WMS, WFS, WMTS, OAI-PMH, OpenSearch
- structured data: none detected using Structured Data Testing Tool

skybristol commented 4 years ago

What I'm actually most interested in for thinking about the various hubs is something a little deeper into the specific content expressed from the various potential contributors. You are getting at that a little bit in your list with the "content types" piece, but I would be interested in just a slight bit more detail in terms of what types of data, publications, etc. the various hubs we're evaluating are providing.

Even having something like the topical indexing approach that the SPREP-PROE example uses is very useful, and evaluating hubs based on whether or not they do provide this type of value-added organization to their content would be useful.

skybristol commented 4 years ago

I would also further break up thinking about the content from repositories into two parts - primary and secondary. Primary contributions are going to be the things that the mission of whatever hub is most concerned with and responsible for. What is the core purpose and mission of the hub, and what does that determine in terms of what they are managing and serving up? Primary information may well be unique to the particular hub and the only or best place for getting that information in the whole network. Secondary information would be the stuff the hub has information about but that may be more appropriately linked to and sourced from either some other hub or "the commons." This categorization gives us an opportunity to look at a couple of dynamics we should be thinking about with regard to hub evaluation.

The 4th example, PCRAFI, appears to be more of a primary repository for certain types of data important to the mission of the project. They've adopted some reasonable technologies that have enabled the system to be fairly open and accessible to both humans and algorithms, but there are some significant content problems that put the overall utility outside the particular context in some jeopardy. The tech chosen and the syntactic standards that come along with them mean that there is great potential for this system to provide rich metadata and important points of interoperability such as exposure of underlying data models and even data summarization as a service. Unfortunately, it appears that most data served by this system have received minimal data stewardship treatment in terms of the details that would exploit those strengths of the platform.

Aggregators, like the first three in the list, are an interesting case in themselves. Aggregators are in a position to add interesting value to the network, but they may or may not be operating in such a way that the value is actually realized. These values also need to be a part of our thinking as we look to describe the ODIS-Architecture.

They can serve as a performance buffer, providing sustainable access to the original sources they are aggregating that may not be as robust or reliable or could even go dark over time, leaving the aggregation point as the best available online access.
They can perform some type of data harmonization and enhancement work that adds additional value on top of what their sources provide. For metadata aggregators, this might include things like harmonizing contact information for people and organizations, analyzing for duplicate records and either making de-duplication decisions, or flagging for use. For data aggregators, this might include steps to align data with standards, running data transformation steps or adding additional properties derived from source.
They may operate quality assurance analysis and flagging in the data to help downstream users make decisions.
They may provide different formats of the data or service/API access such as spatial services that support online visualization, query and subsetting, statistical summarization, or other value-added access points to the data.

For any of these things, it's important for the value-added services to "declare themselves" and provide transparency into what they are doing, decisions they've made along the way, uncertainties they may have introduced, and other dynamics so that downstream users can be aware. There are useful standards, such as W3C-PROV, to help encode and share this type of information in more usable and robust ways.

From the ODIS-Arch perspective and implementation of OIH, we may want to work in a concept of "important, but low maturity level hubs." I know that could come across as somewhat arrogant, so we'll have to work on semantics of our messaging. We may decide that the information content that a given hub serves is conceptually important enough that the hub should be considered an active part of the network. We would register it and promote it in visualizations and advertisements of the network, and we would exercise "Global Hub" software on it to test its functionality and the reach of its content. We might slurp up its metadata into a Global Hub index and make it available in search results. However, the low maturity level means that it might likely have limited value to any other hub with more focused needs, and it would be a safe bet that things like records from low maturity hubs in our Global Hub index would not see much use and might even receive frowny faces because users still need to travel elsewhere and learn/understand a new context to make use of things they find.

For either of these types of cases, an examination of the recently published TRUST principles that came out of RDA work would be illustrative. Building on the idea of FAIR, the TRUST principles are aimed more at the practice of data stewardship and the operation of data repositories. Making good on those principles in whatever way works for the cultural and organizational context of our OIH hubs is going to result in a higher overall maturity level of the system and the information it is putting on the network. We might also look to the work of Ge Peng and others on data stewardship maturity and processes for measuring and improving in use in NOAA and elsewhere for principles and methods we want to bring into the architecture.

iodepo / odis-arch

Pacific portals #3