Closed hausen closed 2 years ago
I can understand the need for the --clean
option on the extract command (although I would probably just run the clean once on the input of the extract command instead), but I don't understand the need for the --object-type
here. Can you elaborate on your use case?
Thanks for the review!
In some cases, I have no need for the relations in an extract. While it is true that they can simply be ignored after extract
is run, those objects will still be in the output and will be unnecessarily parsed before being discarded. Dropping them as early as possible improves performance.
It can be argued that some combinations of other options with --object-type
may not make sense, for instance --object-type=node --strategy=complete_ways
(without --object-type=way
), but in this case the user will get an empty extract, without any ill side effects, so I assume that the user knows what they're doing.
so I assume that the user knows what they're doing
Unfortunately that isn't always the case. And I forsee a lot of problems if users use, say, --object-type=way
because they think they only need ways but extract
can't do its work sensibly without the nodes, etc. At the very least we'd have to make sure we'd warn about combinations that will not work to give the users some sensible guidance. But it really doesn't make much sense to me to add this option to the extract
command. It doesn't "fit" in here. It is confusing, the average user will not understand how it interacts with the rest of the command.
I have much less of a problem with the --clean
option, that's basically just a way of tweaking the output format, not something that interacts in any way with the rest of the command.
Would warning about combinations that don't make sense, or even warning and aborting the execution with an error code, be an acceptable solution?
Lets look at the option combinations that will actually make sense. Adding --object-type
basically gives us an additional 6 different ways of using the input: n, w, r, n+w, n+r, and w+r. The 7th case is the default n+w+r. Now in the context of the extract
command, any combinations without nodes don't make sense and relations without ways don't make any sense either, this leaves us with n and n+w as sensible use cases with --object-type
. So only 2 out of possible 6 options make even sense. The only reason we are even considering using the --object-type
option is because of its use in other osmium
commands, but I don't think this option is the right approach here.
Looking at the remaining cases of n and n+w: The first is really a special case of the simple
strategy, the other strategies don't make sense with it. So I could image a new strategy, lets call it "nodes_only", which only extracts nodes and nothing else. Compared to the "simple" strategy this would have the added benefit of saving quite some memory, because we don't have to remember which nodes we extracted for the ways. Mind you, I am not saying that I want to introduce this strategy if nobody actually needs its, and I don't really see the use case. But conceptually it make sense to me to see it this way.
Now for the last case n+w: The "smart" strategy is about relations, so it doesn't apply here. So this can only be used usefully with the "simple" and "complete_ways" strategy. Which one are you using @hausen? The "simple" strategy is not used often, so I assume it is "complete_ways". Maybe a strategy option -S ignore-relations
would suffice?
Maybe a strategy option
-S ignore-relations
would suffice?
That would fit my use case, thanks.
However, may I suggest an alternative to future-proof the command, in case someone eventually needs to drop relations while using the smart
or simple
strategies? I can add a --drop relations
option to osmium extract
that will accomplish the same thing and can be combined with any strategy.
~For completion's sake, I can also add --drop nodes
, --drop ways
and --drop relations
to osmium cat
. Dropping nodes
and ways
will only be allowed with cat
.~ Actually, for now, I'll just implement osmium extract --drop relations
. If it proves relevant, it can be extended to osm cat
.
The --drop
would just be a different way of expressing the same thing as --object-types
, that doesn't give us anything.
Having this as an option to the strategy makes it clear that it belongs to that logically. Which it does. Because the strategy can take into account that it doesn't have to handle relations and optimize things internally, in this case it doesn't have to store which ways it has to extract, so it can save memory. There is no problem with implementing this just for the complete_ways
strategy, we can always extend this later if somebody needs it, but for the time being I don't see the need.
To move this forward: Can you split this into two PRs? One for the --clean
option, one for the extract without relations. Please remove the version change and other changes from the CMake config. Just add the one file which is new to the list. And merge the commits please.
I think I wasn't too clear with the --drop relations
option.
Anyway, I'll keep only --clean
in this PR and we'll discuss the rest in another one.
@joto ready for the final review
The cleaning isn't part of the strategy, it is just something we should do when writing out the data.
It feels more like something that should be done when reading the data, at least that's what I can gather from command_cat.cpp
.
I think the best option would be to add this to the Extract class where we are writing out the data. (...) We can't do this directly in the current code
Honestly, I don't think replicating code is a cleaner solution than storing flags in the only possible place where it enables us to keep the code DRY.
Apart from storing the flags in the strategy, I see two alternatives that are demonstrably cleaner:
osmium::io::Reader
, as it's already done with the osm_entity_bits
osmium::io::Writer
Both would require a small refactor in the osmium library. Unfortunately, I don't think I'll be able to devote much time to that in the near future.
I devised a way to make the strategy completely unaware of the clean options. They are passed to the input file and reader instead (see src/extract/io.hpp
).
IMHO, I don't see another clean way of accomplishing this without resorting to either replicating functionality from the osmium lib or refactoring it.
Sorry, but this is much worse. Just adding another level of indirection is not making the code clearer.
It feels more like something that should be done when reading the data,
It doesn't really matter much whether its done when reading or writing, but doing it when writing has the benefit of not doing it at all for the data that's never written, which presumably is the larger part of the data when doing an extract. This would only be different if you create many overlapping extracts, in which case the same data is written our more than once, so has to be cleaned more than once. But that is not the most usual use case.
Honestly, I don't think replicating code is a cleaner solution than storing flags in the only possible place where it enables us to keep the code DRY.
Well, I disagree. Three or four lines of code duplication is well worth it compared to the introduction of the code in extract/io.hpp.
Instead of duplicating code, may I propose this approach?
m_clean_attrs
, which is currently in OptionClean
, will be moved to osmium::io::Writer
in libosmium
(similar to how m_read_which_entities
are kept in osmium::io::Reader
)OptionClean::apply_to
will also be moved to osmium::io::Writer
and will be invoked in osmium::io::Writer::do_flush
right before osmium::memory::Buffer::write_buffer
is calledsrc/extract/extract.cpp:31
What do you think?
Moving anything to libosmium doesn't help us, because we can't use it right away. I don't want to make osmium tool dependent on the very newest version of libosmium, because it makes upgrading more difficult to users. Not to mention that we have to make sure the Reader/Writer are compatible.
Also this is waaay to much effort for such a minor feature, really. I have spent too much time on this already.
Yeah, me too. Let's close it.
I took your code and added the small changes in the Extract class to make it work. This is now in master in 9ab2720. Thanks for your contributions, I know it is sometimes annoying when you have a solution and the maintainer is picky, sorry.
I have also implemented the other thing you wanted @hausen in 6747eb2: There is now an option "-S relations=false" for the complete_ways
strategy in the extract command.
Wow! That's awesome! Thank you very, very much for taking the time to implement this and, most importantly, for osmium as a whole.
Even though the same results can be accomplished by piping the output of
osmium extract
toosmium cat
, running all operations with a single command is more performant since the input only needs to be parsed once.