luomus / finbif2gbif

A software bridge between FinBIF and GBIF
Other
1 stars 0 forks source link

Create an algorithm to infer coordinateUncertaintyInMeters for all units #8

Open wkmor1 opened 1 week ago

wkmor1 commented 1 week ago

Given any Unit in the datawarehouse get the correct value of dwc:coordinateUncertaintyInMeters (value should be NULL when appropriate).

Currently we are using the value of "gathering.interpretations.coordinateAccuracy" which is quite different and has drifted further over time. Before we send data to GBIF we remove values of "gathering.interpretations.coordinateAccuracy" that have been set to 1 and come from Kotka. This fixes some of the discrepancy but far from all.

To complete this subtask create an algorithm in (pseudo)code that takes all the available information about a unit and returns the best possible value of coordinateUncertaintyInMeters or returns NULL if no such value exists. Input data will include: geographic information, source of data, verbatim information fields, source of coordinates, data restriction (obfuscation etc) and other stuff? All these data must be available from the unit/list or /collection API endpoints.

wkmor1 commented 3 days ago

Property definition from https://dwc.tdwg.org/list/#dwc_coordinateUncertaintyInMeters

"The horizontal distance (in meters) from the given dwc:decimalLatitude and dwc:decimalLongitude describing the smallest circle containing the whole of the dcterms:Location. Leave the value empty if the uncertainty is unknown, cannot be estimated, or is not applicable (because there are no coordinates). Zero is not a valid value for this term."

AlpoTurunen commented 3 days ago

Here you are. This algorithm calculates the smallest bounding circle for each geometry in a dataset, where the radius can represent the coordinate uncertainty. The calculate_minimum_circle.py file is written in Python but structured in a way that resembles pseudocode.

calculate_minimum_circle.zip

wkmor1 commented 3 days ago

Ok so this applies to some of the units in the data-warehouse I suppose, but surely not all? But it seems a little over complicated even for the units that have geometries? We already have a centroid calculated for all units so the radius is just the maximal distance from centroid to the vertices isn't it? I want an algorithm that covers as many cases as possible including the ones where the answer is necessarily NULL. The algorithm should specifically reference the properties that we have in our data. And specifically address the data coming from different systems (inat, vihko, kotka etc) and collections.

Also why is this in a zip file? Couldn't you just paste it in a comment and mark it up as code?

wkmor1 commented 3 days ago

If you are going to write an actual algorithm in python you could start with a random set of occurrences and calculate the correct value of dwc:coordinateUncertaintyInMeters manually. Then use that set as unit tests. And finally run it on a second set as validations.

AlpoTurunen commented 2 days ago

It applies to all units with geometry, the only group of observations for which coordinate uncertainty can be estimated. This is one of the best algorithms for calculating minimum bounding circles (and radius), which was required in the dwc standard. The method you explained is easier to implement and understand, but it doesn't work well if vertices are unevenly distributed (as they usually are). The centroid is not the same and circles are generally too big.

What is the difference between different data sources in this case? You can compare the circle diameter to the coordinateAccuracy field and select the bigger one. Are there some other fields I should take into account? :)

AlpoTurunen commented 2 days ago

The difference between methods is not always big: Image

wkmor1 commented 2 days ago

You need to reread the above definition. You definitely do not need to have any geometry. The actual geometry can be completely unknown. All you need to know is the radius of a circle that would contain the geometry. By definition you don't even need to know the point location of the circle (apart from the caveat about non-applicability). It is the smallest circle not the smallest possible circle. For example the true geometry might be a 10x10m square but the point location given for whatever reason is 300m away but also not known; and for a set of particular reasons the coordinateUncertaintyInMeters is 500.

There a lots of differences between all our data sources. This task is to find out what those differences are and how they are relevant to calculating this property.

AlpoTurunen commented 1 day ago

What is the point of calculating coordinate uncertainties for units without geometry? The definition says _"Leave the value empty if the uncertainty ... is not applicable (because there are no coordinates)".

I'm sorry I didn't fully understand your explanation... Do you mean something like this:

    lat_precision = number of decimals in latitude coordinate
    lon_precision = number of decimals in longitude coordinate
    precision = min(lat_precision, lon_precision)

    # Approximate uncertainty based on precision
    if WGS84:
        if precision == 1:
            return 11000 # Because 1 degree ~ 111 km -> 1 decimal is 11 km
        elif precision == 2:
            return 1100  
        elif precision == 3:
            return 110
        elif precision == 4:
            return 11 
        elif precision >= 5:
            return 1  
    else if YKJ:
        if precision == 1:
            return 10000 
        elif precision == 2:
            return 1000  
        elif precision == 3:
            return 100  
        elif precision == 4:
            return 10 
        elif precision >= 5:
            return 1 
    else:
        return NULL

... 

    if "generated automatically" in some verbatim fields:
        return something
....

    if coordinate source == municipality:
        return the diameter of municipality
wkmor1 commented 1 day ago

I've moved this to the icebox for now. Can be picked up and worked on anytime