ropensci / c14bazAAR

R Package - Download and Prepare C14 Dates from Different Source Databases
https://docs.ropensci.org/c14bazAAR
GNU General Public License v2.0
30 stars 12 forks source link

wrong database encodings #146

Open MartinHinz opened 3 years ago

MartinHinz commented 3 years ago
>>> file -i PA20110001_S01.txt                                     
PA20110001_S01.txt: text/plain; charset=iso-8859-1

This leads to "wrong" site names.

nevrome commented 3 years ago

Should be fixed in v2.4.2

MartinHinz commented 3 years ago

Thanks! Same is true for palmisano. Will report other errors as they appear.

nevrome commented 3 years ago

Palmisano is probably superseded by aida (#144) soon. But keep them coming anyway. And feel free to make PRs right away.

nevrome commented 3 years ago

Hm - ok - palmisano is not superseded by aida.

So what is the correct encoding of the radiocarbon.csv file in the palmisano db zip archive?

file -i radiocarbon.csv 
radiocarbon.csv: text/plain; charset=unknown-8bit

That's not helpful.

Not many site names are obviously affected. I only see two, so we could overwrite them manually.

Grotta dell�۪Orso
Osteria dell�۪Osa Necropolis

A lot of the citation strings are heavily broken, though.

MartinHinz commented 3 years ago

I have tried 228 Encodings, none of them worked for the first line source (Skeates). I assume, it is simply gibberish?

The following did not work:

  [1] "437"                 "850"                 "852"                 "855"                
  [5] "857"                 "860"                 "861"                 "862"                
  [9] "863"                 "865"                 "866"                 "869"                
 [13] "ARMSCII-8"           "ATARI"               "ATARIST"             "CP-GR"              
 [17] "CP-IS"               "CP1046"              "CP1124"              "CP1125"             
 [21] "CP1129"              "CP1133"              "CP1163"              "CP1250"             
 [25] "CP1251"              "CP1252"              "CP1254"              "CP1256"             
 [29] "CP1257"              "CP1258"              "CP154"               "CP437"              
 [33] "CP737"               "CP775"               "CP819"               "CP850"              
 [37] "CP852"               "CP853"               "CP855"               "CP857"              
 [41] "CP858"               "CP860"               "CP861"               "CP862"              
 [45] "CP863"               "CP864"               "CP865"               "CP866"              
 [49] "CP869"               "CP922"               "CP932"               "CP943"              
 [53] "CSHPROMAN8"          "CSIBM1163"           "CSIBM855"            "CSIBM857"           
 [57] "CSIBM860"            "CSIBM861"            "CSIBM863"            "CSIBM864"           
 [61] "CSIBM865"            "CSIBM866"            "CSIBM869"            "CSISOLATIN1"        
 [65] "CSISOLATIN2"         "CSISOLATIN3"         "CSISOLATIN4"         "CSISOLATIN5"        
 [69] "CSISOLATIN6"         "CSISOLATINCYRILLIC"  "CSKOI8R"             "CSMACINTOSH"        
 [73] "CSPC775BALTIC"       "CSPC850MULTILINGUAL" "CSPC862LATINHEBREW"  "CSPC8CODEPAGE437"   
 [77] "CSPCP852"            "CSPTCP154"           "CSSHIFTJIS"          "CSVISCII"           
 [81] "CYRILLIC"            "CYRILLIC-ASIAN"      "GEORGIAN-ACADEMY"    "GEORGIAN-PS"        
 [85] "HP-ROMAN8"           "HZ"                  "HZ-GB-2312"          "IBM-1163"           
 [89] "IBM-CP1133"          "IBM1163"             "IBM437"              "IBM775"             
 [93] "IBM819"              "IBM850"              "IBM852"              "IBM855"             
 [97] "IBM857"              "IBM860"              "IBM861"              "IBM862"             
[101] "IBM863"              "IBM864"              "IBM865"              "IBM866"             
[105] "IBM869"              "ISO_8859-1"          "ISO_8859-1:1987"     "ISO_8859-10"        
[109] "ISO_8859-10:1992"    "ISO_8859-13"         "ISO_8859-14"         "ISO_8859-14:1998"   
[113] "ISO_8859-15"         "ISO_8859-15:1998"    "ISO_8859-16"         "ISO_8859-16:2001"   
[117] "ISO_8859-2"          "ISO_8859-2:1987"     "ISO_8859-3"          "ISO_8859-3:1988"    
[121] "ISO_8859-4"          "ISO_8859-4:1988"     "ISO_8859-5"          "ISO_8859-5:1988"    
[125] "ISO_8859-9"          "ISO_8859-9:1989"     "ISO-8859-1"          "ISO-8859-10"        
[129] "ISO-8859-13"         "ISO-8859-14"         "ISO-8859-15"         "ISO-8859-16"        
[133] "ISO-8859-2"          "ISO-8859-3"          "ISO-8859-4"          "ISO-8859-5"         
[137] "ISO-8859-9"          "ISO-CELTIC"          "ISO-IR-100"          "ISO-IR-101"         
[141] "ISO-IR-109"          "ISO-IR-110"          "ISO-IR-144"          "ISO-IR-148"         
[145] "ISO-IR-157"          "ISO-IR-179"          "ISO-IR-199"          "ISO-IR-203"         
[149] "ISO-IR-226"          "ISO8859-1"           "ISO8859-10"          "ISO8859-13"         
[153] "ISO8859-14"          "ISO8859-15"          "ISO8859-16"          "ISO8859-2"          
[157] "ISO8859-3"           "ISO8859-4"           "ISO8859-5"           "ISO8859-9"          
[161] "JAVA"                "KOI8-R"              "KOI8-RU"             "KOI8-T"             
[165] "KOI8-U"              "L1"                  "L10"                 "L2"                 
[169] "L3"                  "L4"                  "L5"                  "L6"                 
[173] "L7"                  "L8"                  "LATIN-9"             "LATIN1"             
[177] "LATIN10"             "LATIN2"              "LATIN3"              "LATIN4"             
[181] "LATIN5"              "LATIN6"              "LATIN7"              "LATIN8"             
[185] "MAC"                 "MACCENTRALEUROPE"    "MACCROATIAN"         "MACCYRILLIC"        
[189] "MACGREEK"            "MACHEBREW"           "MACICELAND"          "MACINTOSH"          
[193] "MACROMAN"            "MACROMANIA"          "MACTHAI"             "MACTURKISH"         
[197] "MACUKRAINE"          "MS_KANJI"            "MS-ANSI"             "MS-ARAB"            
[201] "MS-CYRL"             "MS-EE"               "MS-TURK"             "MULELAO-1"          
[205] "NEXTSTEP"            "PT154"               "PTCP154"             "R8"                 
[209] "RISCOS-LATIN1"       "ROMAN8"              "SHIFT_JIS"           "SHIFT_JISX0213"     
[213] "SHIFT-JIS"           "SJIS"                "TCVN"                "TCVN-5712"          
[217] "TCVN5712-1"          "TCVN5712-1:1993"     "VISCII"              "VISCII1.1-1"        
[221] "WINBALTRIM"          "WINDOWS-1250"        "WINDOWS-1251"        "WINDOWS-1252"       
[225] "WINDOWS-1254"        "WINDOWS-1256"        "WINDOWS-1257"        "WINDOWS-1258" 
nevrome commented 3 years ago

You're my man, Martin! Impressive dedication! Let's ask the creator of this database then.

Hey, @apalmisano82, sorry for summoning you once again to this repository. We have some trouble with your dataset "Regional Demographic Trends and Settlement Patterns in Central Italy: Archaeological Sites and Radiocarbon Dates". So far we assumed this data to be UTF-8 encoded, but this does not seem to be right. We're getting a lot of broken symbols, especially in the literature column. Martin now tried a ton of other possible encodings, but none of them match.

  1. Do you remember which encoding you used or do you have another explanation for this issue?
  2. I checked if all of this data is already in AIDA, so that we could fall back on that. But this also does not seem to be the case. This old dataset seems to have some dates not in AIDA yet (or at least not with the same lab numbers... :thinking:). Is this on purpose?

As always: Thanks for your help!