grambank / pygrambank

Apache License 2.0
4 stars 1 forks source link

function for big tsv-binding #43

Closed HedvigS closed 3 years ago

HedvigS commented 3 years ago

sometimes I want to bind all tsv sheets in original_sheets together for various checking, but because of some issues with reader:read_tsv() and the various small differences between out sheets I can't confidently do that. But, pygrambank conflict contains part of the code to do that. Could I have a simple version of pygrambank conflict that only makes one big tsv of all tsvs? Something like pygrambank tsv_rbind?

xrotwang commented 3 years ago

What about using data_table::fread? This seemed to handle the tsvs, see https://github.com/glottobank/Grambank/pull/1277#issuecomment-811835750

HedvigS commented 3 years ago

I'm trying it out now, with limited success. I run into trouble with duplicate names. Trying to chase down what that might mean.

HedvigS commented 3 years ago

Here's what it looks like in case anyone has any ideas. I'll continue debugging.

library(tidyverse)
library(data.table)

fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)

All_raw <- fns[1:100] %>% 
  map_df(
    function(x) data.table::fread(x , fill = TRUE, sep = "\t", 
                                  )   %>% 
      mutate(across(everything(), as.character)))

 Error: Can't transform a data frame with duplicate names.
Run `rlang::last_error()` to see where the error occurred. 
HedvigS commented 3 years ago

@chrzyki have you used data.table::fread() like this before?

HedvigS commented 3 years ago

Found duplicate col names in AR_kung1259. Deleting superfluous one.

HedvigS commented 3 years ago

Same with CB-PE-AS_duun1241. I'l be going through and checking the entire dataset bit by bit for duplicate colnames.

HedvigS commented 3 years ago

Pushing tidying of colnames to this branch. Must have happened when we added the col "Contributed_Datapoints" to all sheets via join, in cases where the existing col was already there. Hm.

HedvigS commented 3 years ago

Now my r sessions are crashing when I'm trying to read in 1:250 sheets.

HedvigS commented 3 years ago

This sheet is making my R-session crash in this pipe: CB-PE-AS_muru1266

xrotwang commented 3 years ago

Hm. csvstat does just fine with this sheet.

HedvigS commented 3 years ago

It's to do with text delim being ' instead of " I think. Very similar issue as I had with other read* functions in r with these sheets before.

HedvigS commented 3 years ago

I can see differences in col alignment when i open this sheet with text delim set to ' or "

HedvigS commented 3 years ago

It seems fread isn't doing that different a thing from the other functions I've tried before. I am a bit at a loss of what to do now.

HedvigS commented 3 years ago

I'm also unsure of how the CLDF-dumping dealt with multiple cols called "Contributed_Datapoints".

xrotwang commented 3 years ago

Did you try:

data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)
HedvigS commented 3 years ago

I was just about to try quote="",

I'll try both

HedvigS commented 3 years ago

Did you try:

data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)

This also crashes my r session

HedvigS commented 3 years ago

quote = "\"" is default

HedvigS commented 3 years ago

effectively disabling quotes with quote = "" works, in the sense that the pipe finishes without an error, but with 26 warnings

HedvigS commented 3 years ago

This


library(tidyverse)
library(data.table)

fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)

All_raw <- fns[1:239] %>% 
  map_df(
    function(x) data.table::fread(x ,
                                  encoding = 'UTF-8', 
                                  quote = "", header = TRUE, 
                                  fill = TRUE, blank.lines.skip = TRUE,
                                  sep = "\t", 
                                  )   %>% 
      mutate(filename = basename(x)) %>% 
      mutate(across(everything(), as.character))) %>% 
      dplyr::select(filename, Feature_ID, Value, Source, Comment, Contributed_Datapoints, Contributed_datapoints)

Currently works. I'll be on lookout though for misplaced cells.

HedvigS commented 3 years ago

Now it breaks because of something with basename() instead.

I don't really know if I should be spending more time on this, or if I should just learn how to use csvstat instead.

It does worry me that there are duplicate colnames and variations in text delim. I'd like that to not be the case, but I realise that might be a lot to ask.

xrotwang commented 3 years ago

There are no variations in text delimiters (if you mean quotes). grambank reads all sheets with exactly the same settings and can round-trip the files.

xrotwang commented 3 years ago

It worries me that CSV reading in R seems so fragile :)

HedvigS commented 3 years ago

Something in "JLA_yaka1277.tsv" breaks basename somehow. trying to debug.

HedvigS commented 3 years ago

There are no variations in text delimiters (if you mean quotes). grambank reads all sheets with exactly the same settings and can round-trip the files.

Okay. It looks to me like CB-PE-AS_muru1266 doesn't have the same text delims as most other sheets.

HedvigS commented 3 years ago

It worries me that CSV reading in R seems so fragile :)

I've never had trouble like this before, but I also don't read in large swats of files edited by multiple different people. I like that I'm getting these error messages, they are informative to me.

xrotwang commented 3 years ago

Oh, I wouldn't complain about error messages. These are definitely good. It's somewhat surprising that there are errors, though.

HedvigS commented 3 years ago

Sure.

I'm getting stuck once again here with basename on JLA_yaka1277 and I haven't yet found out why.

While I'm debugging that, when you have time, can you check that CB-PE-AS_muru1266 being read in correctly for the last 10 features in the grambank cldf dump and also tell me what happens when there are multiple cols called "Contributed_Datapoints"?

xrotwang commented 3 years ago
>>> import pathlib
>>> from csvw.dsv import reader
>>> for i, row in enumerate(reader('original_sheets/CB-PE-AS_muru1266.tsv', delimiter='\t')):
...     if i == 0:
...         n = len(row)
...     else:
...         assert n == len(row)
... 
xrotwang commented 3 years ago
$ grambank describe original_sheets/CB-PE-AS_muru1266.tsv 

Path:
original_sheets/CB-PE-AS_muru1266.tsv

Check:

Dimensions:
195 rows
7 columns

Values:
0   102
1    52
?    38
3     3
    195

Sources:
Oates 1988: 195

Coders:
HunterGatherer  Hunter-Gatherer Language Database : 195
JG  Jemima Goodall: 150
JV  Judith Voss:     34
HedvigS commented 3 years ago

$ grambank describe original_sheets/CB-PE-AS_muru1266.tsv

Path: original_sheets/CB-PE-AS_muru1266.tsv

Check:

Dimensions: 195 rows 7 columns

Values: 0 102 1 52 ? 38 3 3 195

Sources: Oates 1988: 195

Coders: HunterGatherer Hunter-Gatherer Language Database : 195 JG Jemima Goodall: 150 JV Judith Voss: 34

That looks normal, thanks.

Can you see what I mean about the last few rows if you try opening this file in libreoffice with text delim set to " and then set to ' ?

xrotwang commented 3 years ago
$ csvstat -t -c 5,8 original_sheets/CB-PE-AS_duun1241.tsv
  5. "Contributed_Datapoints "

    Type of data:          Text
    Contains null values:  True (excluded from calculations)
    Unique values:         2
    Longest value:         2 characters
    Most common values:    None (194x)
                           JV (1x)

  8. "Contributed_Datapoints"

    Type of data:          Boolean
    Contains null values:  True (excluded from calculations)
    Unique values:         1
    Most common values:    None (195x)

Row count: 195

grambank describe says:

Coders:
HunterGatherer  Hunter-Gatherer Language Database :  75

so I guess the last column with the name (the empty one) takes precedence.

HedvigS commented 3 years ago
$ csvstat -t -c 5,8 original_sheets/CB-PE-AS_duun1241.tsv
  5. "Contributed_Datapoints "

  Type of data:          Text
  Contains null values:  True (excluded from calculations)
  Unique values:         2
  Longest value:         2 characters
  Most common values:    None (194x)
                         JV (1x)

  8. "Contributed_Datapoints"

  Type of data:          Boolean
  Contains null values:  True (excluded from calculations)
  Unique values:         1
  Most common values:    None (195x)

Row count: 195

grambank describe says:

Coders:
HunterGatherer    Hunter-Gatherer Language Database :  75

so I guess the last column with the name (the empty one) takes precedence.

Okay. That might be a problem then.

HedvigS commented 3 years ago

I think I got all the duplicate colnames now. See #1509 Pew pew pew!

HedvigS commented 3 years ago

Okay, what with the duplcaite colnames sorted this now works:


library(tidyverse)
library(data.table)

fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)

All_raw <- fns %>% 
  map_df(
    function(x) data.table::fread(x ,
                                  encoding = 'UTF-8', 
                                  quote = "", header = TRUE, 
                                  fill = TRUE, blank.lines.skip = TRUE,
                                  sep = "\t", 
                                  )   %>% 
      mutate(across(everything(), as.character) %>% 
      mutate(filename = basename(x)) ) 
) 
xrotwang commented 3 years ago

libreoffice seems to get the minimal quoting wrong, too. The row for GB403 has a comment that starts with ". But this isn't quoting the field.

HedvigS commented 3 years ago

libreoffice seems to get the minimal quoting wrong, too. The row for GB403 has a comment that starts with ". But this isn't quoting the field.

Huh, oh really? That's funny.

xrotwang commented 3 years ago

' as quote character only appears better, because no such conflicts arises with it.

xrotwang commented 3 years ago

The comment starts with "palka-" is. A parser could assume the cell is quoted and start parsing. But once it encounters the second " - which is neither escaped nor followed by the cell delimiter - it would have to decide that the cell is not quoted.

xrotwang commented 3 years ago

Just round-tripped CB-PE-AS_muru1266.tsv. So while the python csv parser reads the file mostly correctly, it writes it differently - basically correcting the malformed quoting.

xrotwang commented 3 years ago

We could ask Jemima how she created the sheet.

xrotwang commented 3 years ago
library('data.table')

fns <- list.files("original_sheets/", pattern = ".*tsv", full.names = T)
vals <- 0
sheets <- 0
lapply(fns, function(x){
    sheet <- data.table::fread(file = x, sep = '\t', encoding = 'UTF-8', quote = "\"", header = TRUE, fill = TRUE, blank.lines.skip = TRUE)
    if("Feature_ID" %in% colnames(sheet))
    {
        vals <<- vals + nrow(sheet)
        sheets <<- sheets + 1
    }

})

print(sheets)
print(vals)
HedvigS commented 3 years ago

Somehow when I run:


library(tidyverse)
library(data.table)
library(reshape2)

fns <- list.files("../Grambank/original_sheets/", pattern = ".*.tsv", full.names = T)

All_raw <- fns %>% 
  map_df(
    function(x) data.table::fread(x ,
                                  encoding = 'UTF-8', 
                                  quote = "", header = TRUE, 
                                  fill = TRUE, blank.lines.skip = TRUE,
                                  sep = "\t", 
                                  )   %>% 
      mutate(across(everything(), as.character) %>% 
      mutate(filename = basename(x))
      ) 
) 

One of the cells in the col filenames becomes "Unclear if verb and if productive or not." :D. Which sure looks like a Comment field.

debugging R reading continues for me I'm afraid.

xrotwang commented 3 years ago

That mutate(across(everything() ...)) sounds scary.

xrotwang commented 3 years ago

Also you are back to quote = "" rather than quote = "\"". Why?

HedvigS commented 3 years ago

You're right, I can get back to default quoting and that solves it.

I need to convert all the cols to character otherwise they won't join because they're of different classes.

xrotwang commented 3 years ago

It seems a bit backwards, if you have to undo excess smartness applied by dfread, i.e. the default for reading stuff from CSV should be to return strings, I'd think.

HedvigS commented 3 years ago

It seems a bit backwards, if you have to undo excess smartness applied by dfread, i.e. the default for reading stuff from CSV should be to return strings, I'd think.

The alternative seems to be specifying a col class for each col when it's being read in, which differs for each sheet which makes things break. With tidyverse functions, I know how to give it a default col class for any col it reads in, but I haven't figured out how the argument for that looks like yet for data.table::fread(). I don't usually use data.table, but I'll give it some time and try and figure it out. For now the problem was solved by using default quote settings again.

HedvigS commented 3 years ago

FYI I just ran into a similar issue with readr for reading in parameters.csv over at grambank-cldf. I changed to fread in the relevant place, and things seem to be working smooth now. I'm guessing if other users try and read in that file with readr they'd get a warning about parsing, because that file seems to only have quotes for one of the columns and not the others. I think that's what's confusing readr in that instance.