lucasmation / microdadosBrasil

Reads most common Brazilian public microdata (CENSO, PNAD, etc) easy and fast
164 stars 59 forks source link

bug in POF: file types and data_path not working #173

Open raphael-gouvea opened 6 years ago

raphael-gouvea commented 6 years ago

After the update to include POF 1987/88 and 1995/95, the package is not working for POF 2002. I think you need to review the POF_files_metada_harmonization.csv and the read_POF seems to have a bug when creating the data_path. See reprex below.

library(microdadosBrasil)

get_available_datasets()
#>  [1] "CAGED"                 "CensoEducacaoSuperior"
#>  [3] "CensoEscolar"          "CENSO"                
#>  [5] "ENEM"                  "PME"                  
#>  [7] "PnadContinua"          "PNAD"                 
#>  [9] "PNS"                   "POF"                  
#> [11] "RAIS"
get_available_periods("POF")
#> [1] 2008 2002 1995 1987

# Show that there are more filetypes than it should
file_types <- get_available_filetypes("POF",2002)
# POF 2002 should have only 14 file_types
file_types
#>  [1] "aluguel_estimado"               "caderneta_despesa"             
#>  [3] "condicoes_de_vida"              "consumo"                       
#>  [5] "despesa_12meses"                "despesa_90dias"                
#>  [7] "despesa_individual"             "despesa_veiculo"               
#>  [9] "domicilio"                      "inventario"                    
#> [11] "morador_imput"                  "morador"                       
#> [13] "outras_despesas"                "outros_rendimentos"            
#> [15] "rendimentos"                    "servico_domestico"             
#> [17] "despesa_esp"                    "despesa_6meses"                
#> [19] "despesas_bens_duraveis_credito"

# Show that there is no dictionary for non-existent ft
get_import_dictionary("POF",2002,ft="aluguel_estimado")
#> Error in get_import_dictionary("POF", 2002, ft = "aluguel_estimado"): There is no available dictionary for this year. You can help to expand the package creating the dictionary, see more information at https://github.com/lucasmation/microdadosBrasil

aluguel<-read_POF(2002,ft="aluguel_estimado",root_path = "~/Desktop/teste_microBrasil/")
#> You have specified the 'root_path' argument, in this case we will assume that data is in the directory specified and it is exactly as it have been downloaded from the source.
#> Error in read_data(dataset = "POF", ft = ft, i = i, root_path = root_path, : Data not found. Check if you have unziped the data

# Show that even with existent ft there is a bug in the data_path
get_import_dictionary("POF",2002,ft="domicilio")
#>    int_pos   var_name     x label length decimal_places fin_pos col_type
#> 1        3         uf    2.    NA      2              0       4        i
#> 2        5        seq    3.    NA      3              0       7        i
#> 3        8         dv    1.    NA      1              0       8        i
#> 4        3   controle    6.    NA      6              0       8        i
#> 5        9      domcl    2.    NA      2              0      10        i
#> 6       11    estrato    2.    NA      2              0      12        i
#> 7       13  fator_set  11.5    NA     11              5      23        d
#> 8       24      fator  11.5    NA     11              5      34        d
#> 9       35         pt    2.    NA      2              0      36        i
#> 10      37    pt_real    2.    NA      2              0      38        i
#> 11      39  n_morador    2.    NA      2              0      40        i
#> 12      41       tipo    1.    NA      1              0      41        i
#> 13      42  n_comodos    2.    NA      2              0      43        i
#> 14      44     n_dorm    2.    NA      2              0      45        i
#> 15      46     n_banh    2.    NA      2              0      47        i
#> 16      48     a_agua    1.    NA      1              0      48        i
#> 17      49     esgoto    1.    NA      1              0      49        i
#> 18      50  cond_ocup    1.    NA      1              0      50        i
#> 19      51 e_eletrica    1.    NA      1              0      51        i
#> 20      52       piso    1.    NA      1              0      52        i
#> 21      53     pavrua    1.    NA      1              0      53        i
#> 22      54   temp_mor    1.    NA      1              0      54        i
#> 23      55   quant_uc    1.    NA      1              0      55        i
#> 24      56   contrato    1.    NA      1              0      56        i
#> 25      57      renda 12.4;    NA     12              4      68        d
#>     CHAR
#> 1  FALSE
#> 2  FALSE
#> 3  FALSE
#> 4  FALSE
#> 5  FALSE
#> 6  FALSE
#> 7  FALSE
#> 8  FALSE
#> 9  FALSE
#> 10 FALSE
#> 11 FALSE
#> 12 FALSE
#> 13 FALSE
#> 14 FALSE
#> 15 FALSE
#> 16 FALSE
#> 17 FALSE
#> 18 FALSE
#> 19 FALSE
#> 20 FALSE
#> 21 FALSE
#> 22 FALSE
#> 23 FALSE
#> 24 FALSE
#> 25 FALSE

domicilio<-read_POF(2002,ft="domicilio",root_path = "~/Desktop/teste_microBrasil")
#> You have specified the 'root_path' argument, in this case we will assume that data is in the directory specified and it is exactly as it have been downloaded from the source.
#> [1] 1 2
#> Time difference of 0.3933437 secs
#> 0 Gb
#> Error in paste0(data_path, names(out), ".txt"): object 'data_path' not found

Created on 2018-07-25 by the reprex package (v0.2.0).

raphael-gouvea commented 6 years ago

I still think the POF_files_metadata_harmonization.csv needs review because of the get_available_filetypes.

The problem with the read_POF, however, is simple. Line 206 of the import_wrapper_functions.R is invisible(file.remove(paste0(data_path,names(out),".txt"))). The object out was defined inside a previous if statement that is valid only for years equal to 1987, 1995 and 1997 (I don't get why 1997 was included). Then, this line breaks the code when read_POF is used for years 2002 and 2008. I'm not sending as a PR because I don't understand why this line is relevant. I only commented out the line to test and it does solve the problem.

rafapereirabr commented 5 years ago

It seems there is a similar issue with the POF 2008-2009 data. I get this error when I try to read the data.

# Set working directory
  setwd("R:/Dropbox/bases_de_dados/POF/POF_2008-2009")

# download POF data
  download_sourceData("POF", 2008, unzip = T)

# read POF data layout [This part works fine]
  pof_dic_moradores <- get_import_dictionary(dataset = "POF",i = 2008, ft = "morador")

# read data
  df_moradores <- read_POF(ft = "morador", i = 2008)

> Error in read_data(dataset = "POF", ft = ft, i = i, root_path = root_path,  : 
>  Data not found. Check if you have unziped the data
raphael-gouvea commented 5 years ago

@rafapereirabr, have you checked if the files were unzipped as stated in the error message? In your case, POF2008 files are zipped as .7z and you need to unzip them manually. As I stated in my previous comment, if you remove line 206 or move it to inside the if statement of the import_wrapper_functions.R and build the package you shold be able to use the read_POF function for 2002 and 2008.

rafapereirabr commented 5 years ago

Thank you for the heads up !

luanmugarte commented 2 years ago

I also had an small issue using the read_POF function for the POF 2008-2009 data. I believe it came from using the download_sourceData function, which unzipped the microdata files as "Dadosyyyymmdd" instead of only "Dados" as required (I think) by the read_POF function to properly process the data. Manually renaming the folder fixed the issue.