Closed HedvigS closed 3 years ago
What about using data_table::fread
? This seemed to handle the tsvs, see https://github.com/glottobank/Grambank/pull/1277#issuecomment-811835750
I'm trying it out now, with limited success. I run into trouble with duplicate names. Trying to chase down what that might mean.
Here's what it looks like in case anyone has any ideas. I'll continue debugging.
library(tidyverse)
library(data.table)
fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)
All_raw <- fns[1:100] %>%
map_df(
function(x) data.table::fread(x , fill = TRUE, sep = "\t",
) %>%
mutate(across(everything(), as.character)))
Error: Can't transform a data frame with duplicate names.
Run `rlang::last_error()` to see where the error occurred.
@chrzyki have you used data.table::fread() like this before?
Found duplicate col names in AR_kung1259. Deleting superfluous one.
Same with CB-PE-AS_duun1241. I'l be going through and checking the entire dataset bit by bit for duplicate colnames.
Pushing tidying of colnames to this branch. Must have happened when we added the col "Contributed_Datapoints" to all sheets via join, in cases where the existing col was already there. Hm.
Now my r sessions are crashing when I'm trying to read in 1:250 sheets.
This sheet is making my R-session crash in this pipe: CB-PE-AS_muru1266
Hm. csvstat does just fine with this sheet.
It's to do with text delim being ' instead of " I think. Very similar issue as I had with other read* functions in r with these sheets before.
I can see differences in col alignment when i open this sheet with text delim set to ' or "
It seems fread isn't doing that different a thing from the other functions I've tried before. I am a bit at a loss of what to do now.
I'm also unsure of how the CLDF-dumping dealt with multiple cols called "Contributed_Datapoints".
Did you try:
data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)
I was just about to try quote="",
I'll try both
Did you try:
data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)
This also crashes my r session
quote = "\""
is default
effectively disabling quotes with quote = ""
works, in the sense that the pipe finishes without an error, but with 26 warnings
This
library(tidyverse)
library(data.table)
fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)
All_raw <- fns[1:239] %>%
map_df(
function(x) data.table::fread(x ,
encoding = 'UTF-8',
quote = "", header = TRUE,
fill = TRUE, blank.lines.skip = TRUE,
sep = "\t",
) %>%
mutate(filename = basename(x)) %>%
mutate(across(everything(), as.character))) %>%
dplyr::select(filename, Feature_ID, Value, Source, Comment, Contributed_Datapoints, Contributed_datapoints)
Currently works. I'll be on lookout though for misplaced cells.
Now it breaks because of something with basename()
instead.
I don't really know if I should be spending more time on this, or if I should just learn how to use csvstat instead.
It does worry me that there are duplicate colnames and variations in text delim. I'd like that to not be the case, but I realise that might be a lot to ask.
There are no variations in text delimiters (if you mean quotes). grambank
reads all sheets with exactly the same settings and can round-trip the files.
It worries me that CSV reading in R seems so fragile :)
Something in "JLA_yaka1277.tsv" breaks basename somehow. trying to debug.
There are no variations in text delimiters (if you mean quotes).
grambank
reads all sheets with exactly the same settings and can round-trip the files.
Okay. It looks to me like CB-PE-AS_muru1266 doesn't have the same text delims as most other sheets.
It worries me that CSV reading in R seems so fragile :)
I've never had trouble like this before, but I also don't read in large swats of files edited by multiple different people. I like that I'm getting these error messages, they are informative to me.
Oh, I wouldn't complain about error messages. These are definitely good. It's somewhat surprising that there are errors, though.
Sure.
I'm getting stuck once again here with basename on JLA_yaka1277 and I haven't yet found out why.
While I'm debugging that, when you have time, can you check that CB-PE-AS_muru1266 being read in correctly for the last 10 features in the grambank cldf dump and also tell me what happens when there are multiple cols called "Contributed_Datapoints"?
>>> import pathlib
>>> from csvw.dsv import reader
>>> for i, row in enumerate(reader('original_sheets/CB-PE-AS_muru1266.tsv', delimiter='\t')):
... if i == 0:
... n = len(row)
... else:
... assert n == len(row)
...
$ grambank describe original_sheets/CB-PE-AS_muru1266.tsv
Path:
original_sheets/CB-PE-AS_muru1266.tsv
Check:
Dimensions:
195 rows
7 columns
Values:
0 102
1 52
? 38
3 3
195
Sources:
Oates 1988: 195
Coders:
HunterGatherer Hunter-Gatherer Language Database : 195
JG Jemima Goodall: 150
JV Judith Voss: 34
$ grambank describe original_sheets/CB-PE-AS_muru1266.tsv
Path: original_sheets/CB-PE-AS_muru1266.tsv
Check:
Dimensions: 195 rows 7 columns
Values: 0 102 1 52 ? 38 3 3 195
Sources: Oates 1988: 195
Coders: HunterGatherer Hunter-Gatherer Language Database : 195 JG Jemima Goodall: 150 JV Judith Voss: 34
That looks normal, thanks.
Can you see what I mean about the last few rows if you try opening this file in libreoffice with text delim set to " and then set to ' ?
$ csvstat -t -c 5,8 original_sheets/CB-PE-AS_duun1241.tsv
5. "Contributed_Datapoints "
Type of data: Text
Contains null values: True (excluded from calculations)
Unique values: 2
Longest value: 2 characters
Most common values: None (194x)
JV (1x)
8. "Contributed_Datapoints"
Type of data: Boolean
Contains null values: True (excluded from calculations)
Unique values: 1
Most common values: None (195x)
Row count: 195
grambank describe says:
Coders:
HunterGatherer Hunter-Gatherer Language Database : 75
so I guess the last column with the name (the empty one) takes precedence.
$ csvstat -t -c 5,8 original_sheets/CB-PE-AS_duun1241.tsv 5. "Contributed_Datapoints " Type of data: Text Contains null values: True (excluded from calculations) Unique values: 2 Longest value: 2 characters Most common values: None (194x) JV (1x) 8. "Contributed_Datapoints" Type of data: Boolean Contains null values: True (excluded from calculations) Unique values: 1 Most common values: None (195x) Row count: 195
grambank describe says:
Coders: HunterGatherer Hunter-Gatherer Language Database : 75
so I guess the last column with the name (the empty one) takes precedence.
Okay. That might be a problem then.
I think I got all the duplicate colnames now. See #1509 Pew pew pew!
Okay, what with the duplcaite colnames sorted this now works:
library(tidyverse)
library(data.table)
fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)
All_raw <- fns %>%
map_df(
function(x) data.table::fread(x ,
encoding = 'UTF-8',
quote = "", header = TRUE,
fill = TRUE, blank.lines.skip = TRUE,
sep = "\t",
) %>%
mutate(across(everything(), as.character) %>%
mutate(filename = basename(x)) )
)
libreoffice seems to get the minimal quoting wrong, too. The row for GB403 has a comment that starts with "
. But this isn't quoting the field.
libreoffice seems to get the minimal quoting wrong, too. The row for GB403 has a comment that starts with
"
. But this isn't quoting the field.
Huh, oh really? That's funny.
'
as quote character only appears better, because no such conflicts arises with it.
The comment starts with "palka-" is
. A parser could assume the cell is quoted and start parsing. But once it encounters the second "
- which is neither escaped nor followed by the cell delimiter - it would have to decide that the cell is not quoted.
Just round-tripped CB-PE-AS_muru1266.tsv. So while the python csv parser reads the file mostly correctly, it writes it differently - basically correcting the malformed quoting.
We could ask Jemima how she created the sheet.
library('data.table')
fns <- list.files("original_sheets/", pattern = ".*tsv", full.names = T)
vals <- 0
sheets <- 0
lapply(fns, function(x){
sheet <- data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)
if("Feature_ID" %in% colnames(sheet))
{
vals <<- vals + nrow(sheet)
sheets <<- sheets + 1
}
})
print(sheets)
print(vals)
Somehow when I run:
library(tidyverse)
library(data.table)
library(reshape2)
fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)
All_raw <- fns %>%
map_df(
function(x) data.table::fread(x ,
encoding = 'UTF-8',
quote = "", header = TRUE,
fill = TRUE, blank.lines.skip = TRUE,
sep = "\t",
) %>%
mutate(across(everything(), as.character) %>%
mutate(filename = basename(x))
)
)
One of the cells in the col filenames becomes "Unclear if verb and if productive or not."
:D. Which sure looks like a Comment field.
debugging R reading continues for me I'm afraid.
That mutate(across(everything() ...))
sounds scary.
Also you are back to quote = ""
rather than quote = "\""
. Why?
You're right, I can get back to default quoting and that solves it.
I need to convert all the cols to character otherwise they won't join because they're of different classes.
It seems a bit backwards, if you have to undo excess smartness applied by dfread
, i.e. the default for reading stuff from CSV should be to return strings, I'd think.
It seems a bit backwards, if you have to undo excess smartness applied by
dfread
, i.e. the default for reading stuff from CSV should be to return strings, I'd think.
The alternative seems to be specifying a col class for each col when it's being read in, which differs for each sheet which makes things break. With tidyverse functions, I know how to give it a default col class for any col it reads in, but I haven't figured out how the argument for that looks like yet for data.table::fread(). I don't usually use data.table, but I'll give it some time and try and figure it out. For now the problem was solved by using default quote settings again.
FYI I just ran into a similar issue with readr for reading in parameters.csv over at grambank-cldf. I changed to fread in the relevant place, and things seem to be working smooth now. I'm guessing if other users try and read in that file with readr they'd get a warning about parsing, because that file seems to only have quotes for one of the columns and not the others. I think that's what's confusing readr in that instance.
sometimes I want to bind all tsv sheets in original_sheets together for various checking, but because of some issues with reader:read_tsv() and the various small differences between out sheets I can't confidently do that. But, pygrambank conflict contains part of the code to do that. Could I have a simple version of pygrambank conflict that only makes one big tsv of all tsvs? Something like pygrambank tsv_rbind?