Utility for syncing training, validation, and evaluation data.

On most of the datasets I'm putting together, there is not always a 1-1 matching of masks to tiles. At the very least there should be clarification that the trainer needs a directory where all files are in sync. Even better would be to provide a simple pre-processing script for syncing the masks/tiles or in rs_trainer provide an option to ignore or remove un-synced masks/tiles.

Currently I just use a simple python script to sync the directory:

import os
import argparse

def dir_dict(dir: str) -> dict:
    dd = {}

    for subdir, dirs, files in os.walk(dir):
        for file in files:
            f = '/'.join(os.path.join(subdir, file).split("/")[-3:])
            dd[f] = os.path.join(subdir, file)

    return dd

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('dir1', type=str)
    parser.add_argument('dir2', type=str)
    args = parser.parse_args()

    removed = []

    dir1_dict = dir_dict(args.dir1)
    dir2_dict = dir_dict(args.dir2)

    for k, v in dir1_dict.items():
        if not dir2_dict.get(k):
            removed.append(v)

    for k, v in dir2_dict.items():
        if not dir1_dict.get(k):
            removed.append(v)

    for file in removed:
        os.remove(file)

    return len(removed)

if __name__ == "__main__":
    print ( f"removed {main()} un-synced files" )

mapbox / robosat

Utility for syncing training, validation, and evaluation data. #188