Erreur pour agregation puis split UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position

gdaudin commented 4 years ago

Lorsque je lance les deux scripts pythons suivant dos à dos:

aggregate_sources_in_bdd_centrale.py et split_bdd_centrale_in_sources.py

j’ai l’erreur:

guillaumedaudin@Oronte scripts % python3 /Users/guillaumedaudin/Documents/Recherche/Commerce\ International\ Français\ XVIIIe.xls/Balance\ du\ commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py Traceback (most recent call last): File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 59, in existingfiles[filepath] = sum((1 for in f)) - 1 File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 59, in existingfiles[filepath] = sum((1 for in f)) - 1 File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

gdaudin commented 4 years ago

Pourtant <grep -axv '.*' bdd_centrale.csv> ne détecte pas de caractères non-UTF-8

gdaudin commented 4 years ago

The error is also in the August 12th version of the production branch on my computer. It might be a computer-linked bug ?

paulgirard commented 4 years ago

Ok so I don't have this on my linux. I removed an unused dependency and add some explicit args to the line which breaks. I will try to reuse what you did before rolling back tomorrow.

gdaudin commented 4 years ago

I pulled and still have the issue : guillaumedaudin@Oronte base % python3 /Users/guillaumedaudin/Documents/Recherche/Commerce\ International\ Français\ XVIIIe.xls/Balance\ du\ commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py Traceback (most recent call last): File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 58, in existingfiles[filepath] = sum((1 for in f)) - 1 File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 58, in existingfiles[filepath] = sum((1 for in f)) - 1 File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte guillaumedaudin@Oronte base %

gdaudin commented 4 years ago

I am not sure I got what you meant in " I will try to reuse what you did before rolling back tomorrow."

Could you simply add the two columns/variables into bdd_centrale.csv and do the split ? I will take care of dealing with the schema and putting in the values. Put them between value_minus_unit_val_x_qty and trade_deficit, please

gdaudin commented 4 years ago

Though I admit this is unsatisfying...

gdaudin commented 4 years ago

C’est bon

gdaudin commented 4 years ago

The bug is back @paulgirard

Traceback (most recent call last): File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 59, in existingfiles[filepath] = sum((1 for in blouf)) - 1 File "/Users/guillaumedaudin/Documents/Recherche/Commerce International Français XVIIIe.xls/Balance du commerce/Retranscriptions_Commerce_France/toflit18_data_GIT/scripts/split_bdd_centrale_in_sources.py", line 59, in existingfiles[filepath] = sum((1 for in blouf)) - 1 File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/csv.py", line 110, in next self.fieldnames File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/csv.py", line 97, in fieldnames self._fieldnames = next(self.reader) File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 571: invalid start byte

gdaudin commented 4 years ago

Donc j’ai remplacé r+ par r et cela marche. Glup, glup.

medialab / toflit18_data

Erreur pour agregation puis split UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position #24