microbiomedata / issues

public repo for issues related to NMDC work
1 stars 0 forks source link

how to handle abbreviations in geo_loc_name in schema and data portal #398

Open aclum opened 10 months ago

aclum commented 10 months ago

Is your feature request related to a problem? Please describe. Something that came up when getting ready for ESA is that some geo_loc_names use a country or state name written out and some use an abbreviation. When this data is then searched in the data portal it doesn't treat these values as equivalent. If there are values for 'AK' and 'Alaska' those have to be searched and selected for separately.

Describe the solution you'd like A decision on business rules about how and were this will be handled (ie by convention abbreviate country but not state/province, ingest code translates this so records in mongo are consistent) The ideal solution is that the data would be represented consistently throughout NMDC infrastructure.

Describe alternatives you've considered Leave data as a mix of spelled out and abbreviations.

Acceptance Criteria Create a checklist or scenario-based acceptance criteria, from the users perspective, that answers the following:

Who will use this feature/enhancement? Data portal users When will they use it? When searching by a geographic name How will they use it? This will create a better search experience when searching by a geographic name. How will they test it to make sure it's working? A search for Alaska/AK includes samples from both 'Fairbanks' and 'Denali' Is the request achievable? During one sprint? TBD What is your definition of done for this request?

@naglepuff @jeffbaumes @turbomam @mslarae13 @shreddd please weigh in there as this will determine if we discuss this further in the metadata meeting or the infrastructure meeting

mslarae13 commented 10 months ago

My preference, since we have no way of validating via the submission portal, we should create a map and correct the metadata values.

So, when @pkalita-lbl 's script that goes from subport to mongo (this needs a name), if anything provided in this slot matches anything we've designated, it'll be corrected/ replaced.... so if a user puts AK, it'll correct that to Alaska.

For things like the MANY way Puerto Rico is written, we should decide how this should be displayed and do the same. Think of as many of these as possible for the script correction.

Interium fix util we have some kind of tool that when you add lat_long, geo_loc can be inferred/ added programmatically, and it's no longer an issue! (this will be great for elevation and climate, and temperature, and etc.)

turbomam commented 10 months ago

+1 for inferring as many of these fields as possible from one that we really trust.

mslarae13 commented 10 months ago

@turbomam IDK if we've check discrepancy between lat_lon and geo_loc_name. My point here is stop asking people to fill out geo_loc_name... ask them to just provide lat_lon & we programmatically fill out additional information (geo_loc-name, elevation, temp, etc)

'the inference will depend on Stan or other Oak Ridge people.' what? why?

turbomam commented 9 months ago

Thanks for keeping this discussion alive, @mslarae13. I really appreciate your efforts to decrease the burden on submitters and to obtain more standardized metadata.

An advantage of continuing to ask for lat_lon, geo_loc_name and elev is that the three can be checked against one another. If they aren't congruent, it's a flag for followup with the submitter. But I'm not saying that advantage should outweigh the other goals.

I checked for geospatial discrepancies in NCBI soil metagenome metadata and reported the results in my presentation in Bangkok.

I started a submission-schema branch that checks for similar discrepancies in either any MongoDB production biosample_set collection or in the submissions from the nmdc-server/SubmissionPortal API endpoint, by using a Google $$$ endpoint. If this is useful we can make it an issue for me or somebody else to continue it.

For example, the ‎lat_lon_diff column in https://github.com/microbiomedata/submission-schema/blob/issue-151-geocoding/mongodb_lat_lon_diff.tsv shows the distance in meters between the lat_lon asserted by the submitter and the latitude and longitude inferred by geocoding the asserted geo_loc_name. Some of the distances are quite high. Maybe that's mostly because the geo_loc_name was vague and got misinterpreted.

turbomam commented 9 months ago

I mentioned Stan and Oak Ridge because they are responsible for developing the API with which we provide a latitude and longitude and can get in return a place name, the elevation, and potentially soil type, etc.

Partial list of related issues. Some of these are old, but I believe the work is still in progress.

turbomam commented 9 months ago

If we do decide to request lat_lon from submitters, and infer (but not request) geo_loc_name, elev, etc., then we will need to improve our documentation and validation for lat_lon

It should be really clear that lat_lon is meant to gather the precise geospatial location from which each sample was collected. It shouldn't be determined after the fact by clicking on google maps, near the place where the researcher remembers having collected the samples.

Optional details about GPS accuracy and `lat_lon` decimal places If I was going to really go overboard, I would say that we should request reporting of the accuracy or DOP of the latitude or longitude. All handheld GPS receivers and many cell phone apps can report that. As an alternative to that obsessiveness, we could set some guidelines for the number of decimal places that should be reported for the latitude and the longitude. That won't solve the case in which the GPS hardware is in a low-precision state, but it could improve the case in which the submitter truncates the digits for some reason. Here's an [article about the "accuracy" of a latitude and longitude](https://support.garmin.com/en-US/?faq=hRMBoCTy5a7HqVkxukhHd8), depending on the number of digits. When using an accurate GPS device, 5 digits corresponds to an "accuracy" of ~ 1 meter. About 25% of the `Biosamples` in MongoDB have 4 or fewer digits now.
turbomam commented 9 months ago

Reverse geocoding (converting a lat_lon to a geo_loc_name) may be tricky, at least with the reverse geocoders I'm familiar with, including Google's geocoding API. Many of these APIs return many geospatial features for the specified altitude and longitude. Sometimes it includes a geographical feature, but often it is more like a postal address.

https://geocode.maps.co/ is a free alternative, but only processes two requests per second max. That might be OK for embedding live in the SubmissionPortal, but it wouldn't be desirable for bulk conversion. It returns fewer place types, but maybe that is preferable.

aclum commented 9 months ago

How much guidance is there from the GSC on abbreviations, order and specificity? I noticed when looking at some data in GOLD the geo_loc_name is listed in the opposite order. That is, the example in MIXS 6 in the excel spreadsheet and the linkml representation is USA: Maryland, Bethesda but GOLD typically has city before state so it would be USA: Bethesda, Maryland Are we allowing country: state only? Currently the pattern validation in nmdc-schema doesn't like this (ie USA: Alaska does not validate with the soil DH interface.

turbomam commented 9 months ago

There is no actionable guidance about anything from MIxS 6.6.1 and it shows in the NCBI Biosample records.

In NMDC, we have imposed the requirement for three strings, separated with a colon, then a comma. The intention is to collect a nation/ocean part: a first-order administrative divison (aka State) part, and a local part. So country: state alone is not currently acceptable. And USA: Bethesda, Maryland clearly violates the intention.

Should the nation part always be spelled out or sometimes capitalized? NCBI provides this list and attributes it to INSDC.

Can the local part be the name of a municipality, a building or institution, or a natural feature? There's no guidance on that, just the MIxS example 'USA: Maryland, Bethesda'

I like @mslarae13's suggestion to require lat_lon and then to use a service to look up things like elev and geo_loc_name. The question is do we want to do that live by adding some callback function to DataHarmonizer, or should we do it in bulk after the fact? Many geocoders provide multiple mappings from a latitude and a longitude, and I bet we can't anticipate how the submitters would want their geo_loc_names to appear (like municipality name vs natural feature name in the third position). So from that perspective it would be better to get a few suggestions and display them in DataHarmonizer for the submitter to confirm.

I'm looking forward to @pkalita-lbl's input too.

turbomam commented 9 months ago

Note that NCBI has their own take on the necessity for the second and third components of the geo_loc_name

Value format    "<country_value>[:<region>][, <locality>]" where
                country_value is any value from the controlled vocabulary at
                http://www.insdc.org/country.html
Example         /country="Canada:Vancouver"
                /country="France:Cote d'Azur, Antibes"
                /country="Atlantic Ocean:Charlie Gibbs Fracture Zone"

They also don't allow a space between the first component, the colon, and the (option form their perspective) second component.

The requirements from EBI/ENA are probably even a little different from MIxS or NCBI, and probably more computable. See their soil checklist.

aclum commented 9 months ago

At least for the soil package EBI/ENA only requires the country or sea, leaving the region and locality as optional. NCBI also makes region and locality optional. Screenshot 2023-09-12 at 2 00 26 PM