ucl-ihi / CodeClub

Repository for all things IHI CodeClub
https://ucl-ihi.github.io/CodeClub/
3 stars 5 forks source link

Fixing accents errors in excel #14

Open PaulaSpinola opened 4 years ago

PaulaSpinola commented 4 years ago

I need to download some text data in Portuguese which comes with accents (e.g. ã, í, é, ...). The website gives me the option to download the file either in excel or pdf. Although I want the data in excel, it reads the accents with error whereas pdf doesn't (the files are attached). Would anyone know how to fix such errors in excel? Many thanks Paula

certificados_cnrm_AM.xls.xlsx certificados_cnrm_AM.pdf

alhenry commented 4 years ago

Hi Paula,

This may not the best way to do it, but I think I manage to open it with:

  1. Download the .pdf version
  2. Open in Adobe Acrobat (I use Adobe Acrobat Pro DC ver 19.12.20036 in MacOS X)
  3. Select File > Export to > spread sheet > Format: Excel Workbook
  4. Select Settings
  5. Set Language to "Brazilian Portuguese"
Screenshot 2019-08-22 15 39 49

Results here: certificados_cnrm_AM.xlsx

Is this what you want?

PaulaSpinola commented 4 years ago

Many thanks Albert. I already tried that but then another problem comes up. Some individual level data which should be shown in a unique row are broken into two rows (e.g. rows 8 & 9, rows 17 & 18, rows 35 & 36). I am actually trying to write a VBA code to correct that, but am struggling on that (I might write another post on it if that is ok). I also try the other way around of what you suggested. Download the excel file from the system after making sure the Portuguese language is active in excel spreadsheets. It didn't work either.

image

alhenry commented 4 years ago

Hmmm, I don't know VBA, but if the problem now is just the "broken rows", this can be fixed using other programming languages as I think the main reason for this is due to incorrect page break in column (2 and 6). Here is an example using R:

library(readxl)
library(dplyr)
library(stringr)

file <- "https://github.com/ucl-ihi/CodeClub/files/3530647/certificados_cnrm_AM.xlsx"
tmp <- tempfile(fileext = ".xlsx")
httr::GET(url = file, httr::write_disk(tmp))

df <- read_excel(tmp, skip = 1)

df_neat <- df %>%
  # remove column 3 & 13 (importing issue)
  select(-3, -13) %>% 
  # concatenate value column 2 & 6 with the next row's value
  # if the next row's value in column 1 (ID) is missing (NA)
  mutate_at(vars(2,6),
            ~if_else(is.na(lead(df[[1]])),
                     paste(., lead(.)), .)) %>% 
  # remove duplicated whitespace and \n character
  mutate_if(is.character, str_squish) %>% 
  # remove rows with missing column 1 (ID) value 
  filter(!is.na(.[[1]]))

# Fix column name
old_cols <- colnames(df) %>% str_squish
new_cols <- c(old_cols[1],
              str_split(old_cols[3], " ")[[1]],
              old_cols[11], old_cols[13])
colnames(df_neat) <- new_cols

writexl::write_xlsx(df_neat, "OUTFILE.xlsx")

Results: OUTFILE.xlsx

You may need to reformat some of the columns (e.g. change Date column format to "Short Date) in excel to get the output that you want.

danlewer commented 4 years ago

It might be easier to work with the PDF - there are various ways to extract tables from a PDF - e.g. https://pdftables.com - or lots of advice if you Google!