ioos / ioos-code-sprint

Information about IOOS Code Sprint activities.
https://ioos.github.io/ioos-code-sprint/
MIT License
8 stars 14 forks source link

[Project Proposal]: AI Ready Data #32

Open hmoustahfid-NOAA opened 5 months ago

hmoustahfid-NOAA commented 5 months ago

Project Description

The ESIP community developed a Checklist to examine AI-Readiness for open environmental datasets. Some early evaluation of this checklist shows that the checklist asks questions, but it doesn't score your answers, so it's up to the user to guess whether the checklist believes that all "no '''s are equally bad. It will be worth evaluating this checklist and developing a potential Checker for IOOS RAs to make sure we get our Datasets AI ready. We started circulating a questionnaire to NOS offices and Programs and we are getting some interesting answers. To gather more information from RAs we will be circulating the same questionnaires. Please take some time to consider the following questions for each of your program offices. This example of the ESIP community developed AI-readiness checklist may be beneficial in answering these questions.

Would you consider your program offices' open environmental data AI ready? Are there particular aspects/parameters for your data that prevent it from being AI ready? Have you/anyone in your program office used an AI-readiness checklist before? For current projects leveraging AI, were there additional measures taken to get the data AI ready? What were they?

Potential expected Outcomes

Prototype an AI Ready Data Checker/or assessment tool that can score each checklist question etc. Again we will be listening to y’all for other ideas.

Skills required

Data science, Understanding AI ready Data (ESIP- https://www.esipfed.org/merge/collaboration-updates/checklist-ai-ready-data) Data management Coders

Expertise

Intermediate

Topic Lead(s)

TBD (it will be great if someone from this team) Proposed by Hassan Moustahfid, Micah Wengren and Matt Biddle (IOOS team)

Relevant links

ESIP AI Ready Data Checklist

Here are some examples of existing compliance checkers/assessment tools to evaluate and learn from to develop AI Ready Data checkers or assessment tools to respond to the needs of the AI community. Please see above the concerns and challenges that folks are facing to evaluate if they are compliant with the ESIP AI Ready Data checklist.

Examples Organization What's for?
IOOS Compliance Checker NOAA IOOS The IOOS Compliance Checker is a Python-based tool for data providers to check for completeness and community standard compliance of local or remote netCDF files against CF and ACDD file standards. The Python module can be used as a command-line tool or as a library that can be integrated into other software.
FAIR Data Assessment Tool-f-uji PANGAEA F-UJI is a web service to programatically assess FAIRness of research data objects at the dataset level based on the FAIRsFAIR Data Object Assessment Metrics
pyQuARC NASA Open Source Library for Earth Observation Metadata Quality Assessment
FAIR-Checker Looks like French Group Assess FAIRness

Tentative workplan

Tasks 1- Review the ESIP AI Ready Checklist (focus on Training Data level 3 Metadata?) 2- Evaluate existing checker (differences) commonalities etc.
3- Select a checker to build on/prototype etc.

MathewBiddle commented 4 months ago

@hmoustahfid-NOAA can you elaborate on what you are hoping to get out of a code sprint topic on AI ready data? Is this building some sort of checker that you can run a dataset through to tell you if it's AI ready? Or is this evaluating how well we are already doing this?

It would be good to include some examples if those are available too.

jonmjoyce commented 3 months ago

This is an area we've been thinking about recently, specifically for model data output. Of course, there may be different definitions of AI-ready for different datasets. For model data, I think first it needs to be Analysis Ready, Cloud Optimized (ARCO) which we can achieve through kerchunk. This makes it possible to read massive amounts of data from the cloud without having to copy/download the training set. Second, we need to be able to logically categorize and organize that data, perhaps also label it. Being able to label specific features (e.g. this is what a hurricane looks like) on the grids themselves is probably also useful. Lastly, we need to be able to extract the data in a consistent manner, which may need regridding, unstructured node selection, etc. depending on the grid type, especially if involving multiple models. We are already addressing these challenges through Xpublish so leveraging that work may make sense here too.

hmoustahfid-NOAA commented 3 months ago

I guess it be will nice to build some sort of checker, before we build the checker we may need to evaluate how well we are already doing this.

MathewBiddle commented 2 months ago

Thank you for taking the time to propose this topic! From the Code Sprint topic survey, this has garnered a lot of interest.

Following the contributing guidelines on selecting a code sprint topic I have assigned this topic to @hmoustahfid-NOAA . Unless indicated otherwise, the assignee will be responsible for identifying a plan for the code sprint topic, establishing a team, and taking the lead on executing said plan. The first action for the lead is to:

megancromwell commented 1 month ago

AI Ready Data begins with data that meets the minimum requirements for data that is FAIR and builds out from there. How do we document what the outward facets are, then build guidance and tools to empower people to make data AI Ready? Thinking about the existing ESIP checklist, lack of scoring mentioned above, and other ambiguities in this emerging field led to thinking about data maturity indexes in general, which the environmental data community is still struggling with in various ways. A few pubs that may be useful background information.

A capability maturity model for scientific data management - https://doi.org/10.1002/meet.14504701359

A Unified Framework for Measuring Stewardship Practices Applied to Digital Environmental Datasets - https://datascience.codata.org/articles/10.2481/dsj.14-049

MathewBiddle commented 1 month ago

@hmoustahfid-NOAA just a reminder to create the sprint topic webpage soon. Se https://github.com/ioos/ioos-code-sprint/issues/32#issuecomment-2093488236

MathewBiddle commented 1 month ago

@hmoustahfid-NOAA I went ahead and created a topic page for this. https://ioos.github.io/ioos-code-sprint/2024/topics/07-ai-ready-data-checker.html

Feel free to make adjustments using the "Improve this page" link at the bottom of the page.

jcermauwedu commented 1 month ago

A good first step is to sift through the checklist to (1) determine what is already being tested by the existing checkers; (2) potentially determine which checklist items could be added with a AI type checker. Attached is the current list of checks and a brief description of each. all_checks.txt

jonmjoyce commented 1 month ago

It would be helpful for me (who is not a scientist) to see a few use-cases / user stories for how IOOS data might be used for AI. What are some datasets that are likely to be important? Mostly obs data? Any model output? Reanalysis?

In addition, what are some of the domain-specific considerations when selecting IOOS data to train a model? What are the biases that need to be addressed? E.g. seasonality, extreme weather events, Atlantic vs Pacific... Obviously solving those problems will be up to the model implementer, but what sorts of guidance or data statistics are important to capture to more easily guide potential implementers toward selecting appropriate training/test/validation sets?

fgayanilo commented 1 month ago

The NOAA IOOS Compliance Checker utilizes, among others, the NOAA IOOS Metadata Profile (1.2; https://ioos.github.io/ioos-metadata/) as a reference. Mapping the requirements as drafted by the AI-Data Ready ESIP Cluster (Doug Rao, Lead) indicates that the NOAA IOOS Profile fulfills much of the requirements of the AI Data Ready Checklist. However, the profile may need some minor adjustments that can be used by the Compliance Checker to tag if data is AI-ready. For example, the creator's name in the IOOS Metadata Profile is tagged as 'Recommended', but it is required in the ESIP AI Ready Checklist. With the help of the ESIP AI Data Ready Cluster, mapping the requirements with the IOOS Metadata Profile will need to be completed before an update to the IOOS Compliance Checker can be made.

The team will continue the discussion until the mapping of the requirements is completed.

hmoustahfid-NOAA commented 1 month ago

Following on yesterday's discussion (see Felimon recap), compliance checker team (Filipe, Rob, Micah and Hassan) discussed way fwd to update the IOOS Metadata profile.

Action:

1-Given the NCEI NetCDF template served as the model for the IOOS Metadata Profile, working with NCEI would be the best course of action. This is because NCEI is currently updating their template to include needs from the ESIP AI ready data checklist. The IOOS metadata profile could then be updated by IOOS using the revised NCEI template.

2- Hassan contacted Doug Rao (NCAI) and Megan Cromwell (NOS) to discuss a way fwd to work with NCEI to consider implementing AI ready data requirements etc.

I hope this will serve as a solid model for creating a reliable procedure to update the IOOS Metadata Profile should new requirements arise in the future.

TBC