Open DragonEmperorG opened 6 years ago
Could you elaborate what you mean here?
What you can do is something along the lines of
sort -R dataset.tiles > randomized.tiles
split -n 10 randomized.tiles split-
to split your dataset into 10 splits of randomly sampled tiles without overlap. From there on you can use one split for validation, one for evaluation, and cat the rest for training.
What I want to do includes the sort and split handle, and on this basis the tool creates a dataset meet with the rs train(in other word, it uses the result of the sort and split handle, and use rs subset to complet the function). Is this necessary?
I don't think we should do this in rs subset
. Mostly because right now rs subset
can be used for multiple use-cases e.g. creating your dataset but also for cleaning up your dataset and for hard-negative mining.
Got it. Also putting forward this issues, I'd like to talk about the #91 and #73 . Although referred to the #73 's method, I remove the uneven number of inputs and target. It works. I wonder whether it is necessary to add some check script showing the datasets-related error, or to intelligently deal with the datasets?
We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of https://github.com/mapbox/robosat/issues/91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.
I see it. The user should be responsible for preparing the datasets. But I really think that making the usable datasets isn't an easy thing. And it would be really helpful to show the missing part of the datasets. Thank you for the comment.
Strong agreement with @DragonEmperorG:
But I really think that making the usable datasets isn't an easy thing.
This is the first stage in the pipeline where the inputs aren't an obvious output of a previous stage, with no guidance on how to create them; it also marks a point where terminology seems to shift from mask
--> label
(unless I'm misunderstanding that). Given a set of tiles specified in rs cover
, it seems like it'd be pretty straightforward to add sensible default behavior here.
Addendum: I agree that rs subset
may not be the best place for this, but rs rasterize
via the dataset config might be able to handle it.
Given the config file, we could randomly produce standard training dataset from the raw dataset. By the way, at present if I want to train my dataset, I must create validation.tiles and training.tiles from the raw dataset.tiles manually. I mean I just copy and past the raw dataset.tiles. Could there be more efficient ways?