Open PaulaSpinola opened 4 years ago
Hi Paula,
This may not the best way to do it, but I think I manage to open it with:
Results here: certificados_cnrm_AM.xlsx
Is this what you want?
Many thanks Albert. I already tried that but then another problem comes up. Some individual level data which should be shown in a unique row are broken into two rows (e.g. rows 8 & 9, rows 17 & 18, rows 35 & 36). I am actually trying to write a VBA code to correct that, but am struggling on that (I might write another post on it if that is ok). I also try the other way around of what you suggested. Download the excel file from the system after making sure the Portuguese language is active in excel spreadsheets. It didn't work either.
Hmmm, I don't know VBA, but if the problem now is just the "broken rows", this can be fixed using other programming languages as I think the main reason for this is due to incorrect page break in column (2 and 6). Here is an example using R:
library(readxl)
library(dplyr)
library(stringr)
file <- "https://github.com/ucl-ihi/CodeClub/files/3530647/certificados_cnrm_AM.xlsx"
tmp <- tempfile(fileext = ".xlsx")
httr::GET(url = file, httr::write_disk(tmp))
df <- read_excel(tmp, skip = 1)
df_neat <- df %>%
# remove column 3 & 13 (importing issue)
select(-3, -13) %>%
# concatenate value column 2 & 6 with the next row's value
# if the next row's value in column 1 (ID) is missing (NA)
mutate_at(vars(2,6),
~if_else(is.na(lead(df[[1]])),
paste(., lead(.)), .)) %>%
# remove duplicated whitespace and \n character
mutate_if(is.character, str_squish) %>%
# remove rows with missing column 1 (ID) value
filter(!is.na(.[[1]]))
# Fix column name
old_cols <- colnames(df) %>% str_squish
new_cols <- c(old_cols[1],
str_split(old_cols[3], " ")[[1]],
old_cols[11], old_cols[13])
colnames(df_neat) <- new_cols
writexl::write_xlsx(df_neat, "OUTFILE.xlsx")
Results: OUTFILE.xlsx
You may need to reformat some of the columns (e.g. change Date column format to "Short Date) in excel to get the output that you want.
It might be easier to work with the PDF - there are various ways to extract tables from a PDF - e.g. https://pdftables.com - or lots of advice if you Google!
I need to download some text data in Portuguese which comes with accents (e.g. ã, í, é, ...). The website gives me the option to download the file either in excel or pdf. Although I want the data in excel, it reads the accents with error whereas pdf doesn't (the files are attached). Would anyone know how to fix such errors in excel? Many thanks Paula
certificados_cnrm_AM.xls.xlsx certificados_cnrm_AM.pdf