Windows reading unicode off of Google sheet causes error during `--check`

yamyyao commented 2 years ago

When using a Google sheet as a input CSV file on Windows, certain characters cause an error during --check.

Here's the Google sheet, where the cell G28 is causing issues with the value of "தொகுதி 1 - தொடக்கத்தில்".

The output is

OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
Extracting CSV data from https://docs.google.com/spreadsheets/d/1DV0Ka0ZafZq3RgCn0x_AetFtlmnMuNFltzck2QL0ApY/edit?usp=sharing (worksheet gid 994664908) to ../repo_islandora_demo_objects\google_sheet.csv.
OK, CSV file ../repo_islandora_demo_objects\google_sheet.csv found.
Error: Row 27 (ID 19) of the CSV file "../repo_islandora_demo_objects\google_sheet.csv" has more columns (18) than there are headers (18).

This error doesn't appear if the cell is replaced with English characters. This error also doesn't appear when using a local CSV with the same values.

config.yml

task: create
host: "https://islandora.traefik.me"
input_csv: 'https://docs.google.com/spreadsheets/d/1DV0Ka0ZafZq3RgCn0x_AetFtlmnMuNFltzck2QL0ApY/edit?usp=sharing'
google_sheets_gid: 994664908
input_dir: ../repo_islandora_demo_objects
allow_missing_files: true
additional_files:
- Transcript: 7
- Supplemental_PDF: 2
allow_adding_terms: true

mjordan commented 2 years ago

In your input_dir there should be a CSV version of your Google Sheet ending in .processed. If you open it in a text editor, what is in that cell?

yamyyao commented 2 years ago

It doesn't appear to have the row at all. Edit: I realize that I should probably elaborate and say that it stops at the row just before that cell

mjordan commented 2 years ago

In Google Sheets UI, if you download as a CSV, what do you see?

yamyyao commented 2 years ago

It all seems correct and that specific cell seems to display properly when opening with Visual Studio Code.

When opening it with Excel, that specific cell shows the value à®¤à¯Šà®•à¯à®¤à®¿ 1 - à®¤à¯Šà®Ÿà®•à¯à®•à®¤à¯à®¤à®¿à®²à¯, but I feel this might be more of an issue with Excel and CSV files rather than anything else.

mjordan commented 2 years ago

To narrow down where the failure is happening, can you use that CSV as Workbench's input to see if it works?

yamyyao commented 2 years ago

It seems to give the same error message as when using the Google sheet:

OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
OK, CSV file ../repo_islandora_demo_objects\islandora_sheet_download.csv found.
Error: Row 27 (ID 19) of the CSV file "../repo_islandora_demo_objects\islandora_sheet_download.csv" has more columns (18) than there are headers (18).

mjordan commented 2 years ago

OK, then it's probably the Python CSV parser or writer. Let me investigate.

mjordan commented 2 years ago

The CSV writer (which writes out the .processed version of the CSV/spreadsheet) is using the default encoding, which on Windows is likely not UTF-8. I'll need to specify UTF-8. If I add a fix this evening, can I ask you to test?

yamyyao commented 2 years ago

Sure, just let me know what you need me to do.

mjordan commented 2 years ago

@yamyyao can you check out the issue-366 branch and test please?

yamyyao commented 2 years ago

Running it on the same sheet containing the problematic characters gave an output of:

OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
Extracting CSV data from https://docs.google.com/spreadsheets/d/1DV0Ka0ZafZq3RgCn0x_AetFtlmnMuNFltzck2QL0ApY/edit?usp=sharing (worksheet gid 994664908) to ../repo_islandora_demo_objects\google_sheet.csv.
OK, CSV file ../repo_islandora_demo_objects\google_sheet.csv found.
Traceback (most recent call last):
  File "workbench", line 1096, in <module>
    check_input(config, args)
  File "C:\Amy\Workbench\islandora_workbench\workbench_utils.py", line 993, in check_input
    csv_column_headers = csv_data.fieldnames
  File "C:\Users\AmyY\AppData\Local\Programs\Python\Python38-32\lib\csv.py", line 97, in fieldnames
    self._fieldnames = next(self.reader)
  File "C:\Users\AmyY\AppData\Local\Programs\Python\Python38-32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5223: character maps to <undefined>

Position 5223 happens before the original problematic characters, and... looks to be a space? I'm not sure why it's not giving an error before that position though, considering there are spaces before that point in the file.

mjordan commented 2 years ago

If you open the . processed file in a text editor/IDE and reveal non-printable characters, can you tell if there's a character in that position that might be causing the error? Also, check to see that the encoding of the file is in fact utf-8.

yamyyao commented 2 years ago

I just now noticed that perhaps I misread earlier when it came to the .processed file? In my input_dir I see a two CSVs, a google_sheet.csv and a google_sheet.csv.prepocessed. When you talk about the .processed file, which one do you mean?

The encoding of both files seem to be utf-8, and within both of them it looks like the character in that position is a space. That position sits in the middle of the field_description for an object, so it's just some English text? It doesn't look like it should be a problematic character.

mjordan commented 2 years ago

google_sheet.csv.prepocessed is the one that Workbench actually uses. If you make a copy of that one and configure it to be the input_csv file, what happens?

Can you change line 2573 in workbench_utils.py (branch 366) from encoding="utf-8-sig" to encoding="utf-8" to see if that makes a difference?

yamyyao commented 2 years ago

Using the .prepocessed as the input_csv does not make a difference and gives the same output. Changing the line to encoding="utf-8" also did not make a difference.

mjordan commented 2 years ago

@yamyyao I'm now testing on Windows. In the CSV file exported by Google Sheets, I got a little further by replacing the smart quotes in record 13's "field_description" column with regular quotes (around "Trade of Canada Commodity Classification"). Can you change those quotes in the source Google Sheet and try again?

yamyyao commented 2 years ago

Changing the quotations on the sheet into regular quotes (") or even removing the quotes entirely didn't change anything about the output. However, when I switched to using a local CSV and changing the quotes into regular quotes, it returned this:

OK, connection to Drupal at https://islandora.traefik.me verified.
OK, configuration file has all required values (did not check for optional values).
OK, CSV file ../repo_islandora_demo_objects\google_sheet.csv found.
Error: Row 12 (ID 199) of the CSV file "../repo_islandora_demo_objects\google_sheet.csv" has more columns (19) than there are headers (18).

I notice that the entire field_description column's information is surrounded by straight quotes in the preprocessed CSV, so maybe this new error is because of that?

mjordan / islandora_workbench

Windows reading unicode off of Google sheet causes error during `--check` #366

config.yml