add `--clean` option to `extract` command

hausen commented 2 years ago

Even though the same results can be accomplished by piping the output of osmium extract to osmium cat, running all operations with a single command is more performant since the input only needs to be parsed once.

joto commented 2 years ago

I can understand the need for the --clean option on the extract command (although I would probably just run the clean once on the input of the extract command instead), but I don't understand the need for the --object-type here. Can you elaborate on your use case?

hausen commented 2 years ago

Thanks for the review!

In some cases, I have no need for the relations in an extract. While it is true that they can simply be ignored after extract is run, those objects will still be in the output and will be unnecessarily parsed before being discarded. Dropping them as early as possible improves performance.

It can be argued that some combinations of other options with --object-type may not make sense, for instance --object-type=node --strategy=complete_ways (without --object-type=way), but in this case the user will get an empty extract, without any ill side effects, so I assume that the user knows what they're doing.

joto commented 2 years ago

so I assume that the user knows what they're doing

Unfortunately that isn't always the case. And I forsee a lot of problems if users use, say, --object-type=way because they think they only need ways but extract can't do its work sensibly without the nodes, etc. At the very least we'd have to make sure we'd warn about combinations that will not work to give the users some sensible guidance. But it really doesn't make much sense to me to add this option to the extract command. It doesn't "fit" in here. It is confusing, the average user will not understand how it interacts with the rest of the command.

I have much less of a problem with the --clean option, that's basically just a way of tweaking the output format, not something that interacts in any way with the rest of the command.

hausen commented 2 years ago

Would warning about combinations that don't make sense, or even warning and aborting the execution with an error code, be an acceptable solution?

joto commented 2 years ago

Lets look at the option combinations that will actually make sense. Adding --object-type basically gives us an additional 6 different ways of using the input: n, w, r, n+w, n+r, and w+r. The 7th case is the default n+w+r. Now in the context of the extract command, any combinations without nodes don't make sense and relations without ways don't make any sense either, this leaves us with n and n+w as sensible use cases with --object-type. So only 2 out of possible 6 options make even sense. The only reason we are even considering using the --object-type option is because of its use in other osmium commands, but I don't think this option is the right approach here.

Looking at the remaining cases of n and n+w: The first is really a special case of the simple strategy, the other strategies don't make sense with it. So I could image a new strategy, lets call it "nodes_only", which only extracts nodes and nothing else. Compared to the "simple" strategy this would have the added benefit of saving quite some memory, because we don't have to remember which nodes we extracted for the ways. Mind you, I am not saying that I want to introduce this strategy if nobody actually needs its, and I don't really see the use case. But conceptually it make sense to me to see it this way.

Now for the last case n+w: The "smart" strategy is about relations, so it doesn't apply here. So this can only be used usefully with the "simple" and "complete_ways" strategy. Which one are you using @hausen? The "simple" strategy is not used often, so I assume it is "complete_ways". Maybe a strategy option -S ignore-relations would suffice?

hausen commented 2 years ago

Maybe a strategy option -S ignore-relations would suffice?

That would fit my use case, thanks.

However, may I suggest an alternative to future-proof the command, in case someone eventually needs to drop relations while using the smart or simple strategies? I can add a --drop relations option to osmium extract that will accomplish the same thing and can be combined with any strategy.

~For completion's sake, I can also add --drop nodes, --drop ways and --drop relations to osmium cat. Dropping nodes and ways will only be allowed with cat.~ Actually, for now, I'll just implement osmium extract --drop relations. If it proves relevant, it can be extended to osm cat.

joto commented 2 years ago

The --drop would just be a different way of expressing the same thing as --object-types, that doesn't give us anything.

Having this as an option to the strategy makes it clear that it belongs to that logically. Which it does. Because the strategy can take into account that it doesn't have to handle relations and optimize things internally, in this case it doesn't have to store which ways it has to extract, so it can save memory. There is no problem with implementing this just for the complete_ways strategy, we can always extend this later if somebody needs it, but for the time being I don't see the need.

To move this forward: Can you split this into two PRs? One for the --clean option, one for the extract without relations. Please remove the version change and other changes from the CMake config. Just add the one file which is new to the list. And merge the commits please.

hausen commented 2 years ago

I think I wasn't too clear with the --drop relations option.

Anyway, I'll keep only --clean in this PR and we'll discuss the rest in another one.

hausen commented 2 years ago

@joto ready for the final review

hausen commented 2 years ago

The cleaning isn't part of the strategy, it is just something we should do when writing out the data.

It feels more like something that should be done when reading the data, at least that's what I can gather from command_cat.cpp.

I think the best option would be to add this to the Extract class where we are writing out the data. (...) We can't do this directly in the current code

Honestly, I don't think replicating code is a cleaner solution than storing flags in the only possible place where it enables us to keep the code DRY.

Apart from storing the flags in the strategy, I see two alternatives that are demonstrably cleaner:

storing them in osmium::io::Reader, as it's already done with the osm_entity_bits
storing them in osmium::io::Writer

Both would require a small refactor in the osmium library. Unfortunately, I don't think I'll be able to devote much time to that in the near future.

hausen commented 2 years ago

I devised a way to make the strategy completely unaware of the clean options. They are passed to the input file and reader instead (see src/extract/io.hpp).

IMHO, I don't see another clean way of accomplishing this without resorting to either replicating functionality from the osmium lib or refactoring it.

joto commented 2 years ago

Sorry, but this is much worse. Just adding another level of indirection is not making the code clearer.

It feels more like something that should be done when reading the data,

It doesn't really matter much whether its done when reading or writing, but doing it when writing has the benefit of not doing it at all for the data that's never written, which presumably is the larger part of the data when doing an extract. This would only be different if you create many overlapping extracts, in which case the same data is written our more than once, so has to be cleaned more than once. But that is not the most usual use case.

Honestly, I don't think replicating code is a cleaner solution than storing flags in the only possible place where it enables us to keep the code DRY.

Well, I disagree. Three or four lines of code duplication is well worth it compared to the introduction of the code in extract/io.hpp.

hausen commented 2 years ago

Instead of duplicating code, may I propose this approach?

m_clean_attrs, which is currently in OptionClean, will be moved to osmium::io::Writer in libosmium (similar to how m_read_which_entities are kept in osmium::io::Reader)
the code that is currently in OptionClean::apply_to will also be moved to osmium::io::Writer and will be invoked in osmium::io::Writer::do_flush right before osmium::memory::Buffer::write_buffer is called
the attributes that must be cleaned out are passed to the writer in src/extract/extract.cpp:31

What do you think?

joto commented 2 years ago

Moving anything to libosmium doesn't help us, because we can't use it right away. I don't want to make osmium tool dependent on the very newest version of libosmium, because it makes upgrading more difficult to users. Not to mention that we have to make sure the Reader/Writer are compatible.

Also this is waaay to much effort for such a minor feature, really. I have spent too much time on this already.

hausen commented 2 years ago

Yeah, me too. Let's close it.

joto commented 2 years ago

I took your code and added the small changes in the Extract class to make it work. This is now in master in 9ab2720. Thanks for your contributions, I know it is sometimes annoying when you have a solution and the maintainer is picky, sorry.

joto commented 2 years ago

I have also implemented the other thing you wanted @hausen in 6747eb2: There is now an option "-S relations=false" for the complete_ways strategy in the extract command.

hausen commented 2 years ago

Wow! That's awesome! Thank you very, very much for taking the time to implement this and, most importantly, for osmium as a whole.

osmcode / osmium-tool

add `--clean` option to `extract` command #237