Open yamyyao opened 2 years ago
In your input_dir there should be a CSV version of your Google Sheet ending in .processed
. If you open it in a text editor, what is in that cell?
It doesn't appear to have the row at all. Edit: I realize that I should probably elaborate and say that it stops at the row just before that cell
In Google Sheets UI, if you download as a CSV, what do you see?
It all seems correct and that specific cell seems to display properly when opening with Visual Studio Code.
When opening it with Excel, that specific cell shows the value தொகà¯à®¤à®¿ 1 - தொடகà¯à®•à®¤à¯à®¤à®¿à®²à¯
, but I feel this might be more of an issue with Excel and CSV files rather than anything else.
To narrow down where the failure is happening, can you use that CSV as Workbench's input to see if it works?
It seems to give the same error message as when using the Google sheet:
OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
OK, CSV file ../repo_islandora_demo_objects\islandora_sheet_download.csv found.
Error: Row 27 (ID 19) of the CSV file "../repo_islandora_demo_objects\islandora_sheet_download.csv" has more columns (18) than there are headers (18).
OK, then it's probably the Python CSV parser or writer. Let me investigate.
The CSV writer (which writes out the .processed version of the CSV/spreadsheet) is using the default encoding, which on Windows is likely not UTF-8. I'll need to specify UTF-8. If I add a fix this evening, can I ask you to test?
Sure, just let me know what you need me to do.
@yamyyao can you check out the issue-366 branch and test please?
Running it on the same sheet containing the problematic characters gave an output of:
OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
Extracting CSV data from https://docs.google.com/spreadsheets/d/1DV0Ka0ZafZq3RgCn0x_AetFtlmnMuNFltzck2QL0ApY/edit?usp=sharing (worksheet gid 994664908) to ../repo_islandora_demo_objects\google_sheet.csv.
OK, CSV file ../repo_islandora_demo_objects\google_sheet.csv found.
Traceback (most recent call last):
File "workbench", line 1096, in <module>
check_input(config, args)
File "C:\Amy\Workbench\islandora_workbench\workbench_utils.py", line 993, in check_input
csv_column_headers = csv_data.fieldnames
File "C:\Users\AmyY\AppData\Local\Programs\Python\Python38-32\lib\csv.py", line 97, in fieldnames
self._fieldnames = next(self.reader)
File "C:\Users\AmyY\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5223: character maps to <undefined>
Position 5223 happens before the original problematic characters, and... looks to be a space? I'm not sure why it's not giving an error before that position though, considering there are spaces before that point in the file.
If you open the . processed file in a text editor/IDE and reveal non-printable characters, can you tell if there's a character in that position that might be causing the error? Also, check to see that the encoding of the file is in fact utf-8.
I just now noticed that perhaps I misread earlier when it came to the .processed file? In my input_dir I see a two CSVs, a google_sheet.csv
and a google_sheet.csv.prepocessed
. When you talk about the .processed file, which one do you mean?
The encoding of both files seem to be utf-8, and within both of them it looks like the character in that position is a space. That position sits in the middle of the field_description for an object, so it's just some English text? It doesn't look like it should be a problematic character.
google_sheet.csv.prepocessed
is the one that Workbench actually uses. If you make a copy of that one and configure it to be the input_csv
file, what happens?
Can you change line 2573 in workbench_utils.py (branch 366) from encoding="utf-8-sig"
to encoding="utf-8"
to see if that makes a difference?
Using the .prepocessed
as the input_csv
does not make a difference and gives the same output. Changing the line to encoding="utf-8"
also did not make a difference.
@yamyyao I'm now testing on Windows. In the CSV file exported by Google Sheets, I got a little further by replacing the smart quotes in record 13's "field_description" column with regular quotes (around "Trade of Canada Commodity Classification"). Can you change those quotes in the source Google Sheet and try again?
Changing the quotations on the sheet into regular quotes ("
) or even removing the quotes entirely didn't change anything about the output. However, when I switched to using a local CSV and changing the quotes into regular quotes, it returned this:
OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
OK, CSV file ../repo_islandora_demo_objects\google_sheet.csv found.
Error: Row 12 (ID 199) of the CSV file "../repo_islandora_demo_objects\google_sheet.csv" has more columns (19) than there are headers (18).
I notice that the entire field_description column's information is surrounded by straight quotes in the preprocessed CSV, so maybe this new error is because of that?
When using a Google sheet as a input CSV file on Windows, certain characters cause an error during
--check
.Here's the Google sheet, where the cell G28 is causing issues with the value of "தொகுதி 1 - தொடக்கத்தில்".
The output is
This error doesn't appear if the cell is replaced with English characters. This error also doesn't appear when using a local CSV with the same values.
config.yml