mapbox / robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds
MIT License
2.01k stars 382 forks source link

Utility for syncing training, validation, and evaluation data. #188

Open markmester opened 4 years ago

markmester commented 4 years ago

On most of the datasets I'm putting together, there is not always a 1-1 matching of masks to tiles. At the very least there should be clarification that the trainer needs a directory where all files are in sync. Even better would be to provide a simple pre-processing script for syncing the masks/tiles or in rs_trainer provide an option to ignore or remove un-synced masks/tiles.

Currently I just use a simple python script to sync the directory:

import os
import argparse

def dir_dict(dir: str) -> dict:
    dd = {}

    for subdir, dirs, files in os.walk(dir):
        for file in files:
            f = '/'.join(os.path.join(subdir, file).split("/")[-3:])
            dd[f] = os.path.join(subdir, file)

    return dd

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('dir1', type=str)
    parser.add_argument('dir2', type=str)
    args = parser.parse_args()

    removed = []

    dir1_dict = dir_dict(args.dir1)
    dir2_dict = dir_dict(args.dir2)

    for k, v in dir1_dict.items():
        if not dir2_dict.get(k):
            removed.append(v)

    for k, v in dir2_dict.items():
        if not dir1_dict.get(k):
            removed.append(v)

    for file in removed:
        os.remove(file)

    return len(removed)

if __name__ == "__main__":
    print ( f"removed {main()} un-synced files" )
daniel-j-h commented 4 years ago

See https://github.com/mapbox/robosat/issues/93 and https://github.com/mapbox/robosat/issues/93#issuecomment-408142081

We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of #91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.

rs train's pre-conditions are a dataset directory with pairs of images and labels.

I agree with you we could make it clear in the readme, though.

Would you be so kind and open a pull request explaining this? Thanks!