ropensci / osmextract

Download and import OpenStreetMap data from Geofabrik and other providers
https://docs.ropensci.org/osmextract
GNU General Public License v3.0
170 stars 12 forks source link

Don't store gpkg file when reading an osm.pbf file #235

Open luukvdmeer opened 2 years ago

luukvdmeer commented 2 years ago

When I use oe_read() to read an osm.pbf file I have stored on disk, it internally calls oe_vectortranslate() to translate the osm.pbf file to a gpkg file, before reading it. However, it stores this .gpkg file on disk in the same folder as my osm.pbf file and does not remove it afterwards. I would expect oe_read() to just read my osm.pbf file into R, and the translation to gpkg being only an internal procedure that as a user I don't even have to know of. Therefore I would argue that by default the gpkg file should be automatically removed after reading, or stored in a temporary folder rather than in the same folder as my osm.pbf file.

(this is of course different when I explicitly call oe_vectortranslate(), in that case I expect the gpkg file to be stored, because that is the goal of the function).

Robinlovelace commented 2 years ago

I think this is a reasonable suggestion but I'm not sure the benefits would outweigh the costs. The reason for storing the file in the local directory is that it can save time: if you repeatedly need to read in a dataset during different sessions it takes much longer to convert to .gpkg each time than to read in from the .gpkg file. The conversion process is one of the slowest parts. My guess is that more person hours and compute resources will be saved by keeping .gpkg files in the data directory than by changing the default settings so that the are written to the temporary. Could there be an option that could allow the .gpkg file to be saved elsewhere for people who want to?

agila5 commented 2 years ago

Hi Luuk, and thanks for creating this issue.

I would expect oe_read() to just read my osm.pbf file into R, and the translation to gpkg being only an internal procedure that as a user I don't even have to know of. Therefore I would argue that by default the gpkg file should be automatically removed after reading, or stored in a temporary folder rather than in the same folder as my osm.pbf file.

I agree that oe_read may remove the gpkg file after reading it (and hide the whole procedure to the users), but I don't understand which are the benefits of removing that file (a more intuitive/clean process? less disk on space? it might be preferable to leave no artefacts after reading in the data?). I think it might be pretty easy to implement that, but, as Robin said, I'm still not sure if the benefits outweigh the costs. Can you add slightly more details (no rush)?

(Another slightly OT question: do you think that, at the moment, the functions in this package are too verbose? Because I love verbose output with a lot of details, just to be a little bit more secure that nothing happens when converting the data and so on, but I never understand what is a "reasonable" level of detail)

Could there be an option that could allow the .gpkg file to be saved elsewhere for people who want to?

IMO that adds unnecessary complexity since, in that case, we should also check this other directory when running oe_* and decide what to do in case of conflicts between the contents in the standard download directory and this new directory. If you want to save the data you just read in a "non-standard" location, you can run st_save(xxx, "path.gpkg"). Do you agree?

luukvdmeer commented 2 years ago

Can you add slightly more details (no rush)?

Currently I have an osm.pbf file in the inst directory of an R package (in development..). I read this file with oe_read() in a vignette to show examples, but in that way it "clutters" my inst directory. (I guess the best solution here is probably to add the .gpkg file to both .gitignore and .Rbuildignore. You are right that it might even be good to keep the file to save resources when rebuilding vignettes)

More important for me is when I am not working locally. For example when reading pre-processed osm.pbf files (e.g. clipped or pre-filtered) from company servers I would not like to also write to the company servers and store the .gpkg file there permanently. Of course I can manually remove it afterwards. Therefore I also don't think this is a critical issue but it is just not intuitive to me that when I call a read function it automatically also writes something. I also don't think that would be regarded as "best practice" (but not sure about that).

Another slightly OT question: do you think that, at the moment, the functions in this package are too verbose?

My personal opinion would be that as long as I can set quiet = TRUE and this effectively supresses all messages, it does not matter for me how verbose the output is when quiet = FALSE (as you say more info might only be beneficial).

agila5 commented 2 years ago

Thanks, it sounds like reasonable use cases. I will try to implement this later today! Maybe with a new argument in oe_get and oe_read named remove_gpkg ? I think that the default value should be FALSE for the same reasons that Robin explained.

agila5 commented 2 years ago

See #237