ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
317 stars 45 forks source link

update 'post_process_tags' in 'test-features' to reduce mock file sizes #323

Closed mpadge closed 1 year ago

mpadge commented 1 year ago

@jmaspons Could you please help out here by reviewing this PR? It updates a test function i added last year to reduce the sizes of mocked files for the available_tags() function.

The problem

The function scapes this OSM wiki page, and extracts all data from the tables. The problem is there are a large number of large tables, and the resultant mock file is too big. The old version of the function worked, but the web page has changed structure in the meantime (see #322 ), and so this size-reduction function also needs updating.

What is needed

A way to use rvest or xml2 to reduce a HTML document down to a sub-set of all nodes. I don't know how to do that?

What I've done

A really hacky modification of the document-as-text-file, by identifying table rows and removing most of them.

What you can do

If you know anything about document manipulation with rvest/xml2, please just have a look at the code to see if you can think of a better approach. If not, that's fine, and we can just merge anyway, because it does actually work in current form. It's just an embarassing approach to systematic manipulation of HTML docs. Thanks!

jmaspons commented 1 year ago

New code for me, but as I understand, there are two kinds of tables for feature tags: 1) with <div class="taglist", and 2) with <table class="wikitable".

The tests get a sample of each kind of data to generate the mock call response, dropping the header and all surrounding html code. This MR selects a sample of the rows for the type 2 tables to reduce even more the size of the mock files. Then it tests for buildind, a type 2) table. I would add a test for a feature in the type 1) tables (eg. aerialway).

For the row's reduction code, perhaps selecting all nodes but the 10 rows of building data + headers + a sample of type 1) tables, and then using xml2::xml_remove could work, but I don't get how that work and should be tried. I'll try later

mpadge commented 1 year ago

Thanks! And yes, indeed ...

For the row's reduction code, perhaps selecting all nodes but the 10 rows of building data + headers + a sample of type 1) tables, and then using xml2::xml_remove could work, but I don't get how that work and should be tried. I'll try later

That is what should work, but i can only get xml_remove() to remove everything or nothing, not a sub-set.

jmaspons commented 1 year ago

Check #324 to remove the hacky part of this PR. Once merged, I think we can merge this one