ncvo / charity-commission-extract

Python utilities for handling the import of data from the Charity Commission data extract
43 stars 21 forks source link

Extract from RegPlusExtract_May_2020 extract_charity.csv and extract_objects.csv corrupt #12

Open ukfreetv opened 4 years ago

ukfreetv commented 4 years ago

Sorry to be a pain.

Using "RegPlusExtract_May_2020.zip"

If you follow the instructions to run the Python scripts to convert the ".bcp" files to ".csv" files, the script corrupts the extract_charity.csv and extract_objects.csv files.

The issue seems to be linefeed or CR in some fields, the code is leaving them as CRLF, which is the record-end marker for CSV.

I've tried code to stick them back together, but this results in some lines missing fields that are unknown.

ukfreetv commented 4 years ago

Tried doing a direct import into MySQL of the BCP file, but similar problems...

mysql> LOAD DATA local INFILE 'extract_charity.bcp' INTO TABLE extract_charity CHARACTER SET BINARY FIELDS TERMINATED BY '@@' LINES TERMINATED BY '@@'; Query OK, 368474 rows affected, 65535 warnings (21.99 sec) Records: 368474 Deleted: 0 Skipped: 0 Warnings: 1105472**

mysql> LOAD DATA local INFILE 'extract_objects.bcp' INTO TABLE extract_objects CHARACTER SET BINARY FIELDS TERMINATED BY '@@' LINES TERMINATED BY '@@'; Query OK, 640671 rows affected, 2703 warnings (8.93 sec) Records: 640671 Deleted: 0 Skipped: 0 Warnings: 2703**

drkane commented 4 years ago

Sorry to be a pain.

Using "RegPlusExtract_May_2020.zip"

If you follow the instructions to run the Python scripts to convert the ".bcp" files to ".csv" files, the script corrupts the extract_charity.csv and extract_objects.csv files.

The issue seems to be linefeed or CR in some fields, the code is leaving them as CRLF, which is the record-end marker for CSV.

I've tried code to stick them back together, but this results in some lines missing fields that are unknown.

Hi @ukfreetv - I've had similar issues worth line feeds in the data before - but you should be able to alter the CSV import function to deal with it, using the same parameters as the output function.

One alternative is using the bcp-reader module which just streams the fields, even with line breaks.