Document origins of sample data

emlys commented 3 years ago

This came up today related to a forum post, but has come up before. The origins of the sample data sets for many models have been lost in the sands of time. Some are legitimate data, some were but have been edited, and some are completely made up. Without knowing which is which, we can't recommend that data to users (and some probably use the sample data without asking us, assuming it's valid).

This will involve talking to the scientists for each data set we're not sure of. The findings should be documented in the user's guide and/or in a file in the sample data repo. If that's too difficult, at the very least we should clearly mark the sample data as "for illustrative purposes only".

phargogh commented 3 years ago

It would also be great to see if there's a standard, machine-readable (metadata) way to document these things, ideally with established geospatial metadata standards. @cybersea @davemfish would you know of any standard ways that this is usually done?

cybersea commented 3 years ago

Definitely. I second the importance of documenting our sample data to help everyone know where it is from and appropriate uses.

FGDC now uses an ISO metadata standard for geospatial data: https://www.fgdc.gov/metadata/geospatial-metadata-standards. There are a few others that may have fewer required fields, but I would recommend sticking with the ISO if possible. Not exhaustive, but for example, see: https://wcodp.readthedocs.io/metadata/metadata.html#metadata-standards-and-formats. I'm happy to discuss further what might be most feasible and useful for this purpose

Stanford Geospatial Center has someone who could help someone with creating this. I think she has a workflow that includes filling out fields in a spreadsheet and then she has some scripts or tools to convert it to the correct XML format. There are also a few online tools for creating metadata.

davemfish commented 3 years ago

Let's not forget that Stacie (@newtpatrol ) created metadata readme's for 6 of the models already!

Following a standard definitely seems like the right thing to do. I completely defer to Allison on that. I've never worked with "standard" metadata myself and I think the main reason is that I don't actually know how to conveniently read that data. I guess GIS software is designed to read & format those XML files? Personally I prefer a text file I can quickly read while I'm browsing the filesystem, rather than having to load every dataset into GIS to see it's metadata. Also we have CSVs that need metadata too.

Any tips for easy command-line metadata readers?

cybersea commented 3 years ago

Yes, the human-readable versus machine-readable conundrum. I also find that standardized machine-readable metadata are not very human friendly and so also like text files for quick browsing. It really depends on what your objectives are and what information you need to track.

If you want your data to be FAIR (Findable, Accessible, Interoperable, Reusable), you need standardized metadata. But developing a good, sustainable workflow for creating and updating it takes a bit of work. You could see about leveraging the Stanford Geospatial Center's workflow for this.

First step is definitely documentation of any type! It can always be translated into a different format. A lot of the Geospatial metadata fields are easy to derive directly from the data itself, but the hard stuff is the science/processing/rationale behind the data and the attributes, so capturing that while it is fresh is most important.

USGS and FGDC have a lot of resources on geospatial metadata. There is a command line metadata parser (mp) that was developed by USGS, but it may only parse the older FGDC standard and not the current ISO standard. https://geology.usgs.gov/tools/metadata/. There are probably some Python and R packages that will also help with this process.

@jagoldstein probably has more experience with this as well, from his time at NCEAS

jagoldstein commented 3 years ago

@davemfish where do the metadata ReadMes that Stacie created live? I don't believe that they are included in the downloadable sample data.

@cybersea Yes, I understand the arguments for using a standardized format like ISO or EML, but I agree with Dave that it may be overly cumbersome for our objective. Why convert a metadata spreadsheet to XML if we will just want to convert that back to a more human-readable format? At NCEAS it took quite a lot of dev resources to develop and maintain websites (data portals) that rendered XML in human-readable forms, and they always had (have) problems and were a huge PITA. I tired to attach an example of a .xml file here, but the format is not supported so perhaps XML is not as FAIR as is claimed. But, refer to this example and then compare it to its XML file that is used to populate the "human-readable" landing page. I don't advise going down this road at this stage and question if we are even willing to devote the necessary resources to doing so. I fully support standardized documentation, but using XML sounds like more of a barrier than a solution for us right now, however it is a worthy conversation and a longer term goal that should not be discounted.

cybersea commented 3 years ago

I hear you @jagoldstein.

My comments about XML and geospatial metadata standards were in response to the original question about available standard, machine-readable metadata formats. I'm not advocating for a particular approach, but adding to the discussion about formats and potential resources to help inform the decision about what approach to take.

jagoldstein commented 3 years ago

Ha, sorry if I overreacted @cybersea! I think I have some lingering XML trauma. In response, to @phargogh 's Q re: machine-readable geospatial metadata formats, I concur with Allison's answers and would refer you to the same ISO standard used by FGDC.

davemfish commented 3 years ago

@davemfish where do the metadata ReadMes that Stacie created live? I don't believe that they are included in the downloadable sample data.

@jagoldstein they should be there for these six models:

λ find . -type f -name "_READ*"
./Annual_Water_Yield/_README_InVEST_Annual_Water_Yield_model_data.txt
./DelineateIt/_README_InVEST_DelineateIt_data.txt
./NDR/_README_InVEST_NDR_model_data.txt
./RouteDEM/_README_InVEST_RouteDEM_data.txt
./SDR/_README_InVEST_SDR_model_data.txt
./Seasonal_Water_Yield/_README_InVEST_Seasonal_Water_Yield_model_data.txt

davemfish commented 3 years ago

Great insights, everyone. It does sound like a standard machine readable format might be at odds with the original objective outlined by Emily: to better communicate to users where the data came from and if it is "real" or completely fabricated.

We might have other problems that would be solved by creating machine-readable metadata, but we should start by identifying them before jumping in.

I find json to be a nice middle-ground of human & machine-readable. Not sure if any of the standards allow it, but json has certainly replaced xml in so many other contexts in recent years.

newtpatrol commented 3 years ago

For what it's worth, I chose to create a simple text README file to make it easiest for the largest number of people to see the metadata. Since it's relatively easy to create these, for the purpose of describing the rest of the sample data, I'd suggest doing something similar in the short term.

I do concur that it is optimal to also have metadata embedded in the data (and I am thankful when someone whose data I'm using has done so) yet, in my entire time at NatCap, I have never created any. A while back, there was talk of including a few pieces of embedded metadata as part of creating our InVEST outputs, but that faded.

jagoldstein commented 3 years ago

@davemfish thanks for pointing me to the 6 models that have these ReadMe*.txt files in their sample data. I see them now and agree that they are quite helpful. I also see ./CropProduction/model_data/README.md which describes subfolder contents.

cybersea commented 3 years ago

Having a text-based README file and standardized/structured metadata are not mutually exclusive options. Top priority is to document the essential information before it is gone (from your brain, or the person who's brain its in is gone), so hats off to @newtpatrol for doing this.

The standards can help you to understand what types of information are most critical to capture for other people to be able to use your data. And, a text file is machine-readable if it is formatted consistently. You can convert existing metadata/docs to another format if/when it seems helpful or required. Personally, I don't think the standard or the encoding (XML, JSON) is as important as the content.

There are some existing JSON-based standards, like DCAT used by Data.gov, for example.
https://resources.data.gov/resources/data-gov-open-data-howto/ https://resources.data.gov/resources/dcat-us/

Here are a couple other potentially helpful resources: https://www.openaire.eu/how-to-make-your-data-fair https://geopython.github.io/pygeometa/

emlys commented 3 years ago

Thanks everyone for all the good discussion and debate! I agree that while the original intent of this ticket was to produce human-readable documentation, the machine-readable metadata will have an important role too. I haven't gone through the metadata standards yet but I'm guessing they're not the best for some of the information we might want to include, such as

what project, if any, the data came from
why it was chosen for the sample data
how applicable it is to studies in other places: should users use it as a reference?

Some free-form paragraphs in a README type file would be a good format for this. On the other hand it sounds like the metadata is a good place for info that's more specific to the dataset itself, and less about its relationship with InVEST. So maybe they can coexist with a little overlap?

emlys commented 2 years ago

This is probably not worth doing retroactively for all models. We could add a README in the sample data repo to catalog what is known about them. But tracking down all the model creators and asking them would be a lot of work. Let's aim to have a higher standard for documenting future models' data.

newtpatrol commented 2 years ago

Agreed that it is not worth doing retroactively, at least for the oldest models (Carbon, HQ, etc), where it's probably impossible. If there are more recent models where we're still in touch with the science leads (e.g. Urban Cooling and Flood), those would be good to pursue. For those old models where we have no intel, it would be great to still include a README/metadata that says something like "no clue where this came from, do not use it for your own analysis".

emlys commented 2 years ago

Just noting that a user asked about a mystery table in the CBC sample data: https://community.naturalcapitalproject.org/t/blue-carbon-global-idb-table-saltmarshsoil-tab-calculation/2707

natcap / invest

Document origins of sample data #543