mayrajeo / geo2ml

Python library and module for converting earth observation data to be suitable for machine learning models
https://mayrajeo.github.io/geo2ml/
Apache License 2.0
8 stars 0 forks source link

Converting GeoJSON to COCO annotations, purpose of target_column argument? #12

Open fangzp opened 1 week ago

fangzp commented 1 week ago

Hello Janne! Thanks for creating such a useful tool for those of us working on machine learning for geospatial data.

Below is a sample of what I consider to be a pretty standard GeoJSON annotation for targets of interest (in this case, polygon annotations for agricultural fields), converted from a gpkg using ogr2ogr in GDAL. The 'id' corresponds to the ID of the AOI/image tile corresponding to a GeoTIFF named 0_vietnam.tif. I'm trying to understand what the purpose of the target_column argument in the create_coco_dataset function is, because I can't get it to work properly; setting target_column='country' or ='id' or ='geometry' all result in KeyErrors when I try to run the command over the entire dataset. The examples_coco.ipynb example notebook is a little unclear since I don't know what 'layer' is in the shapefiles, seeing as I'm not sure whether the files being loaded in (104_28_Hiidenportti_Chunk1_orto.geojson, etc.) have been uploaded to the repo yet.

What should I be doing instead in this case?

{
"type": "FeatureCollection",
"name": "reference",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::32648" } },
"features": [
{ "type": "Feature", "properties": { "id": 0, "country": "vietnam", "_predicate": "INTERSECTS" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 615463.371700000017881, 2302274.039799999445677 ], [ 615463.603799999691546, 2302274.169700000435114 ], [ 615502.078999999910593, 2302297.654400000348687 ], [ 615534.710699999704957, 2302241.427500000223517 ], [ 615457.650700000114739, 2302194.133099999278784 ], [ 615454.343399999663234, 2302200.747700000181794 ], [ 615449.713200000114739, 2302198.10190000012517 ], [ 615419.286000000312924, 2302249.365000000223517 ], [ 615463.371700000017881, 2302274.039799999445677 ] ] ] } },

Example function call (which results in a KeyError):

for img in train_images:
    print(img)
    shp = path_to_data/'reference_jsons'/f'{img.stem}_areas.json'
    print(shp)
    create_coco_dataset(raster_path=img, polygon_path=shp, target_column='country',
                        outpath=outpath/'train', output_format='gpkg', save_grid=False, allow_partial_data=True,
                        dataset_name=f'{img.stem}_train', gridsize_x=764, gridsize_y=750, 
                        ann_format='polygon', min_bbox_area=8)
mayrajeo commented 1 week ago

Hi,

The purpose for the target_column is to indicate which column contains the data used to populate the categories field in resulting coco.json data. For instance, in your case the categories would be the name of the country. In the example data, the input shapefiles contain a lot more columns with more-or-less useful information, such as whether the annotation is in a managed or conserved forests, so the column layer is specified here.

I quickly tested the following two examples with the CLI functions:

geo2ml_create_coco_dataset 0_vietnam.tif 0_vietnam_areas.gpkg country test vietnam_fields --gpkg_layer reference

and after converting the 0_vietnam_areas.gpkg to geojson:

geo2ml_create_coco_dataset 0_vietnam.tif 0_vietnam_areas.geojson country testgeoj vietnam_fields

and both seemed to work aside from some pyogrio-related warnings. Can you provide a bit more information about the errors so I can fix them and also possibly clarify the examples?

fangzp commented 1 week ago

Thanks for the prompt response! Ultimately what I would like for the categories to be recording is 'ag field', i.e. have a single-category annotation file where all polygons have a category_id of 1 with the category name being something like 'agfields_singleclass', and 'supercategory' being something like 'AgriculturalFields' if necessary, since what I'm trying to predict is not the name of the country but the presence of an agricultural field. I had only chosen 'country' as the target_column because I wasn't sure what else to do, and indeed it works fine for just one file, but then running the above over all of the geojsons (other files have features whose 'country' property is exclusively populated by 'cambodia' for example) results in the KeyError described in the original post.

To try to produce the desired behavior I tried adding an extra property to each of the features in the GeoJSON file (e.g. something like the below):

{"type": "FeatureCollection", "name": "reference", "crs": {"type": "name", "properties": {"name": "urn:ogc:def:crs:EPSG::32648"}}, "features": [{"type": "Feature", "properties": {"id": 0, "country": "vietnam", "_predicate": "INTERSECTS", "ag_field": 1}, "geometry": {"type": "Polygon", "coordinates": [[[615463.3717, 2302274.0397999994], [615463.6037999997, 2302274.1697000004], [615502.0789999999, 2302297.6544000003], [615534.7106999997, 2302241.4275], [615457.6507000001, 2302194.1330999993], [615454.3433999997, 2302200.7477], [615449.7132000001, 2302198.1019], [615419.2860000003, 2302249.365], [615463.3717, 2302274.0397999994]]]}}, ...

but then when I run the below snippet I get an AttributeError: 'Pandas' object has no attribute 'ag_field'.

for img in train_images:
    shp = path_to_data/'reference_jsons'/f'{img.stem}_areas.json'
    create_coco_dataset(raster_path=img, polygon_path=shp,
                        target_column='ag_field',
                        outpath=outpath/'train', output_format='geojson', save_grid=False, allow_partial_data=True,
                        dataset_name=f'{img.stem}_train', gridsize_x=764, gridsize_y=750, 
                        ann_format='polygon', min_bbox_area=8)
mayrajeo commented 1 week ago

I see, I thought that would be what you want to do.

Though, your result seem a bit weird, seems that the files you are using for polygon_path do not have the modified information. I replicated this with these data:

{
    "type": "FeatureCollection",
    "name": "0_vietnam_areas_agfield",
    "crs": {
        "type": "name",
        "properties": {
            "name": "urn:ogc:def:crs:EPSG::32648"
        }
    },
    "features": [
        {
            "type": "Feature",
            "properties": {
                "id": 0,
                "country": "vietnam",
                "_predicate": "INTERSECTS",
                "landcover": "ag_field"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            615463.371700000017881,
                            2302274.039799999445677
                        ],
                        [
                            615463.603799999691546,
                            2302274.169700000435114
                        ],
                        [
                            615502.078999999910593,
                            2302297.654400000348687
                        ],
                        [
                            615534.710699999704957,
                            2302241.427500000223517
                        ],
                        [
                            615457.650700000114739,
                            2302194.133099999278784
                        ],
                        [
                            615454.343399999663234,
                            2302200.747700000181794
                        ],
                        [
                            615449.713200000114739,
                            2302198.10190000012517
                        ],
                        [
                            615419.286000000312924,
                            2302249.365000000223517
                        ],
                        [
                            615463.371700000017881,
                            2302274.039799999445677
                        ]
                    ]
                ]
            }
        },
...

from geo2ml.scripts.data import create_coco_dataset
from pathlib import Path

create_coco_dataset(
    raster_path='0_vietnam.tif', 
    polygon_path='0_vietnam_areas_agfield.geojson',
    target_column='landcover',
    outpath=Path('testp'),
    output_format='geojson',
    save_grid=False,
    allow_partial_data=True,
    dataset_name='example_train',
    gridsize_x=764,
    gridsize_y=750,
    ann_format='polygon',
    min_bbox_area=8
)

and aside from a few RuntimeWarning: Several features with id = 0 have been found. Altering it to be unique. This warning will not be emitted anymore for this layer from pyogrio it worked fine, resulting into following categories in COCO-style json file:

    "categories": [
        {
            "supercategory": "object",
            "id": 1,
            "name": "ag_field"
        }
    ],

Though, I found a possibly related bug when target column contains integers instead of strings, as TypeError: Object of type int32 is not JSON serializable. I'll fix this soon, but until that I'd suggest to use strings as class names.

Did this help you?