mapbox / robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds
MIT License
2.02k stars 383 forks source link

Could the rs subset tool add automaticly construct dataset function? #93

Open DragonEmperorG opened 6 years ago

DragonEmperorG commented 6 years ago

Given the config file, we could randomly produce standard training dataset from the raw dataset. By the way, at present if I want to train my dataset, I must create validation.tiles and training.tiles from the raw dataset.tiles manually. I mean I just copy and past the raw dataset.tiles. Could there be more efficient ways?

daniel-j-h commented 6 years ago

Could you elaborate what you mean here?

What you can do is something along the lines of

sort -R dataset.tiles > randomized.tiles
split -n 10 randomized.tiles split-

to split your dataset into 10 splits of randomly sampled tiles without overlap. From there on you can use one split for validation, one for evaluation, and cat the rest for training.

DragonEmperorG commented 6 years ago

What I want to do includes the sort and split handle, and on this basis the tool creates a dataset meet with the rs train(in other word, it uses the result of the sort and split handle, and use rs subset to complet the function). Is this necessary?

daniel-j-h commented 6 years ago

I don't think we should do this in rs subset. Mostly because right now rs subset can be used for multiple use-cases e.g. creating your dataset but also for cleaning up your dataset and for hard-negative mining.

DragonEmperorG commented 6 years ago

Got it. Also putting forward this issues, I'd like to talk about the #91 and #73 . Although referred to the #73 's method, I remove the uneven number of inputs and target. It works. I wonder whether it is necessary to add some check script showing the datasets-related error, or to intelligently deal with the datasets?

daniel-j-h commented 6 years ago

We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of https://github.com/mapbox/robosat/issues/91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.

DragonEmperorG commented 6 years ago

I see it. The user should be responsible for preparing the datasets. But I really think that making the usable datasets isn't an easy thing. And it would be really helpful to show the missing part of the datasets. Thank you for the comment.

wboykinm commented 5 years ago

Strong agreement with @DragonEmperorG:

But I really think that making the usable datasets isn't an easy thing.

This is the first stage in the pipeline where the inputs aren't an obvious output of a previous stage, with no guidance on how to create them; it also marks a point where terminology seems to shift from mask --> label (unless I'm misunderstanding that). Given a set of tiles specified in rs cover, it seems like it'd be pretty straightforward to add sensible default behavior here.

Addendum: I agree that rs subset may not be the best place for this, but rs rasterize via the dataset config might be able to handle it.