lgnbhl / BFS

🇨🇭Search and Download Data from the Swiss Federal Statistical Office
https://lgnbhl.github.io/BFS
GNU General Public License v3.0
19 stars 5 forks source link

bfs_get_catalog_data() returns empty data.frame #18

Closed lbertela closed 1 month ago

lbertela commented 1 month ago

Hello,

The functions bfs_get_catalog() , bfs_get_catalog_data(), bfs_get_catalog_tables() do not work anymore. They are returning an empty data.frame. Issues were found in CRAN checks : https://cran.r-project.org/web/checks/check_results_BFS.html

Thank you in advance :)

lgnbhl commented 1 month ago

Hello,

Thanks for your message. I just pushed a fix on this GitHub repository. You can install the fix wit the following:

remotes::install_github("lgnbhl/BFS")

Now the functions should work again :).

These functions break because the official BFS website change it RSS feed structure (which I was scraping). Now with the new fix I am getting the catalogs from the BFS API. This should be more stable and allows to access more catalog metadata. In particular the functions returns directly the BFS number which will simplify the general workflow (see the updated README). I am thinking about adding more catalog metadata.

The new version will be pushed to CRAN soon.

Best, FĂ©lix

lbertela commented 1 month ago

Hello, Thank you so much for the quick fix! It's great to hear that the API integration will offer more stability. I’m really excited about the expanded access to metadata, the usability of this package keeps getting better! Best, Ludovic

lbertela commented 1 month ago

Hello again,

When loading both functions bfs_get_catalog_data() and bfs_get_catalog_tables() with language "de" for example, they both return a data.frame limited to 350 lines. Additionnaly, in both data.frames the "number_asset" is now unique for each line.

Would it be possible to extend the search to all available tables or data, and not a limited number of 350 ? Having access to the correct number_asset in bfs_get_catalog_tables() would help greatly if we are interested in downloading the data with bfs_download_asset().

Thanks again!

Ludovic

lgnbhl commented 1 month ago

Hello Ludovic,

Thanks a lot for catching this bug regarding the "number_asset" variable! I just pushed a quick fix, now in BFS version 0.5.10. As usual you can access it with:

remotes::install_github("lgnbhl/BFS")

Regarding the limit of 350 lines, I guess it is an API limit. I will see if it is possible to bypass this limit of 350 lines in an new patch of BFS.

I will push this hot fix on CRAN soon.

Please let me know if you are interested in any other new features for the BFS R package (feel free to create a new GitHub issue for them if they are not related to this bug).

Best regards, FĂ©lix

lgnbhl commented 1 month ago

An idea to get the full data catalog could be to loop over a given argument, for example prodima (possibly dates could work too) using purrr::pmap_dfr():

# themes_names <- c("Statistical basis and overviews 00", "Population 01", "Territory and environment 02", "Work and income 03", "National economy 04", "Prices 05", "Industry and services 06", "Agriculture and forestry 07", "Energy 08", "Construction and housing 09", "Tourism 10", "Mobility and transport 11", "Money, banks and insurance 12", "Social security 13", "Health 14", "Education and science 15", "Culture, media, information society, sports 16", "Politics 17", "General Government and finance 18", "Crime and criminal justice 19", "Economic and social situation of the population 20", "Sustainable development, regional and international disparities 21")
themes_prodima <- c(900001, 900010, 900035, 900051, 900075, 900084, 900092, 900104, 900127, 900140, 900160, 900169, 900191, 900198, 900210, 900212, 900214, 900226, 900239, 900257, 900269, 900276)

library(BFS)
library(purrr)

catalog_all <- purrr::pmap_dfr(
  .l = list(language = "de", prodima = themes_prodima),
  .f = bfs_get_catalog_data,
)
# A tibble: 760 Ă— 9
   title                   language publication_date number_asset order_nr url_px language_available
   <chr>                   <chr>    <date>           <chr>        <chr>    <chr>  <list>            
 1 Heiraten und Heiratshä… de       2024-09-26       32506838     px-x-01… https… <chr [4]>         
 2 Lebendgeburten nach Mo… de       2024-09-26       32506840     px-x-01… https… <chr [4]>         
 3 Scheidungen und Scheid… de       2024-09-26       32506841     px-x-01… https… <chr [4]>         
 4 Todesfälle nach Monat … de       2024-09-26       32506839     px-x-01… https… <chr [4]>         
 5 Männliche Vornamen der… de       2024-08-23       32187356     px-x-01… https… <chr [4]>         
 6 Weibliche Vornamen der… de       2024-08-23       32187357     px-x-01… https… <chr [4]>         
 7 Auswanderung der ständ… de       2024-08-22       32208056     px-x-01… https… <chr [4]>         
 8 Auswanderung der ständ… de       2024-08-22       32208055     px-x-01… https… <chr [4]>         
 9 Auswanderung der ständ… de       2024-08-22       32208061     px-x-01… https… <chr [4]>         
10 Auswanderung der ständ… de       2024-08-22       32208057     px-x-01… https… <chr [4]>         
# â„ą 750 more rows
# â„ą 2 more variables: url_structure_json <chr>, damId <int>
lgnbhl commented 1 month ago

Hello Ludovic,

I have added a new argument named return_raw to allow the access of all the metadata in an raw / unstructured way when calling bfs_get_catalog_data() and bfs_get_catalog_tables(). I have updated the README to explain how to use it with an example.

This new feature is now in BFS version 0.5.11 on GitHub (soon on CRAN). As usual you can access it with: remotes::install_github("lgnbhl/BFS")

You can also access all the metadata of the full data catalog like this:

themes_prodima <- c(900001, 900010, 900035, 900051, 900075, 900084, 900092, 900104, 900127, 900140, 900160, 900169, 900191, 900198, 900210, 900212, 900214, 900226, 900239, 900257, 900269, 900276)

library(BFS)
library(purrr)

purrr::pmap_dfr(
  .l = list(language = "de", prodima = themes_prodima, return_raw = TRUE), # added "return_raw" here
  .f = bfs_get_catalog_data,
)
# A tibble: 760 Ă— 5
   ids$uuid        $contentId $gnp  $damId bfs$embargo description$titles$m…¹ shop$orderNr links
   <chr>                <int> <chr>  <int> <chr>       <chr>                  <chr>        <lis>
 1 ef70eb19-9384-…     325772 2024… 3.25e7 2024-09-26… Heiraten und Heiratsh… px-x-010202… <df> 
 2 32069ba3-1cb4-…     189095 2024… 3.25e7 2024-09-26… Lebendgeburten nach M… px-x-010202… <df> 
 3 5a8b2ea1-e23b-…     325776 2024… 3.25e7 2024-09-26… Scheidungen und Schei… px-x-010202… <df> 
 4 66f3d4f6-edfc-…     189065 2024… 3.25e7 2024-09-26… Todesfälle nach Monat… px-x-010202… <df> 
 5 51dfa1cf-2199-…   13807205 2024… 3.22e7 2024-08-23… Männliche Vornamen de… px-x-010405… <df> 
 6 b65c9036-b000-…   13807212 2024… 3.22e7 2024-08-23… Weibliche Vornamen de… px-x-010405… <df> 
 7 38a86458-22d5-…     189124 2024… 3.22e7 2024-08-22… Auswanderung der stän… px-x-010302… <df> 
 8 6426823f-cb31-…     189120 2024… 3.22e7 2024-08-22… Auswanderung der stän… px-x-010302… <df> 
 9 7f9d861c-81aa-…     189087 2024… 3.22e7 2024-08-22… Auswanderung der stän… px-x-010302… <df> 
10 20fec7fa-cbe5-…     325764 2024… 3.22e7 2024-08-22… Auswanderung der stän… px-x-010302… <df> 
# â„ą 750 more rows
# ℹ abbreviated name: ¹​description$titles$main
# â„ą 14 more variables: ids$languageCopyId <int>, bfs$lifecycle <df[,4]>, $lifecycleGroup <chr>,
#   $provisional <lgl>, $articleModel <df[,4]>, $articleModelGroup <df[,4]>,
#   $lastUpdatedVersion <chr>, description$titles$sub <chr>,
#   description$categorization <df[,13]>, $bibliography <df[,2]>, $shortSummary <df[,2]>,
#   $language <chr>, $abstractShort <chr>, shop$stock <lgl>

If this feature fix this GitHub issue, feel free to close it.