mayrajeo / geo2ml

Python library and module for converting earth observation data to be suitable for machine learning models
https://mayrajeo.github.io/geo2ml/
Apache License 2.0
8 stars 0 forks source link

Converting GeoJSON to COCO annotations, purpose of target_column argument? #12

Closed fangzp closed 2 months ago

fangzp commented 2 months ago

Hello Janne! Thanks for creating such a useful tool for those of us working on machine learning for geospatial data.

Below is a sample of what I consider to be a pretty standard GeoJSON annotation for targets of interest (in this case, polygon annotations for agricultural fields), converted from a gpkg using ogr2ogr in GDAL. The 'id' corresponds to the ID of the AOI/image tile corresponding to a GeoTIFF named 0_vietnam.tif. I'm trying to understand what the purpose of the target_column argument in the create_coco_dataset function is, because I can't get it to work properly; setting target_column='country' or ='id' or ='geometry' all result in KeyErrors when I try to run the command over the entire dataset. The examples_coco.ipynb example notebook is a little unclear since I don't know what 'layer' is in the shapefiles, seeing as I'm not sure whether the files being loaded in (104_28_Hiidenportti_Chunk1_orto.geojson, etc.) have been uploaded to the repo yet.

What should I be doing instead in this case?

{
"type": "FeatureCollection",
"name": "reference",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::32648" } },
"features": [
{ "type": "Feature", "properties": { "id": 0, "country": "vietnam", "_predicate": "INTERSECTS" }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 615463.371700000017881, 2302274.039799999445677 ], [ 615463.603799999691546, 2302274.169700000435114 ], [ 615502.078999999910593, 2302297.654400000348687 ], [ 615534.710699999704957, 2302241.427500000223517 ], [ 615457.650700000114739, 2302194.133099999278784 ], [ 615454.343399999663234, 2302200.747700000181794 ], [ 615449.713200000114739, 2302198.10190000012517 ], [ 615419.286000000312924, 2302249.365000000223517 ], [ 615463.371700000017881, 2302274.039799999445677 ] ] ] } },

Example function call (which results in a KeyError):

for img in train_images:
    print(img)
    shp = path_to_data/'reference_jsons'/f'{img.stem}_areas.json'
    print(shp)
    create_coco_dataset(raster_path=img, polygon_path=shp, target_column='country',
                        outpath=outpath/'train', output_format='gpkg', save_grid=False, allow_partial_data=True,
                        dataset_name=f'{img.stem}_train', gridsize_x=764, gridsize_y=750, 
                        ann_format='polygon', min_bbox_area=8)
mayrajeo commented 2 months ago

Hi,

The purpose for the target_column is to indicate which column contains the data used to populate the categories field in resulting coco.json data. For instance, in your case the categories would be the name of the country. In the example data, the input shapefiles contain a lot more columns with more-or-less useful information, such as whether the annotation is in a managed or conserved forests, so the column layer is specified here.

I quickly tested the following two examples with the CLI functions:

geo2ml_create_coco_dataset 0_vietnam.tif 0_vietnam_areas.gpkg country test vietnam_fields --gpkg_layer reference

and after converting the 0_vietnam_areas.gpkg to geojson:

geo2ml_create_coco_dataset 0_vietnam.tif 0_vietnam_areas.geojson country testgeoj vietnam_fields

and both seemed to work aside from some pyogrio-related warnings. Can you provide a bit more information about the errors so I can fix them and also possibly clarify the examples?

fangzp commented 2 months ago

Thanks for the prompt response! Ultimately what I would like for the categories to be recording is 'ag field', i.e. have a single-category annotation file where all polygons have a category_id of 1 with the category name being something like 'agfields_singleclass', and 'supercategory' being something like 'AgriculturalFields' if necessary, since what I'm trying to predict is not the name of the country but the presence of an agricultural field. I had only chosen 'country' as the target_column because I wasn't sure what else to do, and indeed it works fine for just one file, but then running the above over all of the geojsons (other files have features whose 'country' property is exclusively populated by 'cambodia' for example) results in the KeyError described in the original post.

To try to produce the desired behavior I tried adding an extra property to each of the features in the GeoJSON file (e.g. something like the below):

{"type": "FeatureCollection", "name": "reference", "crs": {"type": "name", "properties": {"name": "urn:ogc:def:crs:EPSG::32648"}}, "features": [{"type": "Feature", "properties": {"id": 0, "country": "vietnam", "_predicate": "INTERSECTS", "ag_field": 1}, "geometry": {"type": "Polygon", "coordinates": [[[615463.3717, 2302274.0397999994], [615463.6037999997, 2302274.1697000004], [615502.0789999999, 2302297.6544000003], [615534.7106999997, 2302241.4275], [615457.6507000001, 2302194.1330999993], [615454.3433999997, 2302200.7477], [615449.7132000001, 2302198.1019], [615419.2860000003, 2302249.365], [615463.3717, 2302274.0397999994]]]}}, ...

but then when I run the below snippet I get an AttributeError: 'Pandas' object has no attribute 'ag_field'.

for img in train_images:
    shp = path_to_data/'reference_jsons'/f'{img.stem}_areas.json'
    create_coco_dataset(raster_path=img, polygon_path=shp,
                        target_column='ag_field',
                        outpath=outpath/'train', output_format='geojson', save_grid=False, allow_partial_data=True,
                        dataset_name=f'{img.stem}_train', gridsize_x=764, gridsize_y=750, 
                        ann_format='polygon', min_bbox_area=8)
mayrajeo commented 2 months ago

I see, I thought that would be what you want to do.

Though, your result seem a bit weird, seems that the files you are using for polygon_path do not have the modified information. I replicated this with these data:

{
    "type": "FeatureCollection",
    "name": "0_vietnam_areas_agfield",
    "crs": {
        "type": "name",
        "properties": {
            "name": "urn:ogc:def:crs:EPSG::32648"
        }
    },
    "features": [
        {
            "type": "Feature",
            "properties": {
                "id": 0,
                "country": "vietnam",
                "_predicate": "INTERSECTS",
                "landcover": "ag_field"
            },
            "geometry": {
                "type": "Polygon",
                "coordinates": [
                    [
                        [
                            615463.371700000017881,
                            2302274.039799999445677
                        ],
                        [
                            615463.603799999691546,
                            2302274.169700000435114
                        ],
                        [
                            615502.078999999910593,
                            2302297.654400000348687
                        ],
                        [
                            615534.710699999704957,
                            2302241.427500000223517
                        ],
                        [
                            615457.650700000114739,
                            2302194.133099999278784
                        ],
                        [
                            615454.343399999663234,
                            2302200.747700000181794
                        ],
                        [
                            615449.713200000114739,
                            2302198.10190000012517
                        ],
                        [
                            615419.286000000312924,
                            2302249.365000000223517
                        ],
                        [
                            615463.371700000017881,
                            2302274.039799999445677
                        ]
                    ]
                ]
            }
        },
...

from geo2ml.scripts.data import create_coco_dataset
from pathlib import Path

create_coco_dataset(
    raster_path='0_vietnam.tif', 
    polygon_path='0_vietnam_areas_agfield.geojson',
    target_column='landcover',
    outpath=Path('testp'),
    output_format='geojson',
    save_grid=False,
    allow_partial_data=True,
    dataset_name='example_train',
    gridsize_x=764,
    gridsize_y=750,
    ann_format='polygon',
    min_bbox_area=8
)

and aside from a few RuntimeWarning: Several features with id = 0 have been found. Altering it to be unique. This warning will not be emitted anymore for this layer from pyogrio it worked fine, resulting into following categories in COCO-style json file:

    "categories": [
        {
            "supercategory": "object",
            "id": 1,
            "name": "ag_field"
        }
    ],

Though, I found a possibly related bug when target column contains integers instead of strings, as TypeError: Object of type int32 is not JSON serializable. I'll fix this soon, but until that I'd suggest to use strings as class names.

Did this help you?

fangzp commented 2 months ago

Thanks again for your help, it seems like the function is working at least for one image at a time. (I still am having trouble with running it over all images in a folder each with their own associated annotation geojson, which is where I'm still getting that 'Pandas' object has no attribute 'landcover' AttributeError.)

I see that the function creates automatically sets of tiled images. For example, the above code snippet for me creates under testp/ two folders, images/ and vectors/ and a file example_train.json which contains the COCO annotations for that image. Beneath the two subdirectories are two .tifs/.geojsons, the former containing R0C0.tif and R1C0.tif, and latter R0C0.geojson and R1CO.geojson. One of these cropped images contains most of the source image `0_vietnam.tif' while the other contains just a small sliver of it; I assume the generated filenames are indexing row and column number of the generated tiles.

Suppose that I wanted to convert all of the training set from the above dataset into COCO-compatible labels, namely in order to pass into a detectron2 instance segmentation model. Is there a way to preserve information about which image the labels originally came from (the original filenames; `0_vietnam.tif' and so forth)? Is there a way to prevent the tiling from occurring (i.e. creating just one whole image and one COCO annotation file corresponding to the image/GeoJSON annotation file which was passed in)?

fangzp commented 2 months ago

Figured out the above on my own. Thanks so much again for the help and for creating such a useful tool!