pelias / csv-importer

Import arbitrary data in CSV format to Pelias
MIT License
23 stars 22 forks source link

Issue with double-quotes chars while trying to import a csv file #110

Closed nick-rv closed 1 week ago

nick-rv commented 1 week ago

Describe the bug I am currently evaluating the pelias tool for geocoding. And i am trying to import a CSV dataset containing french adresses.

But the import job interrupts while a double-quote character is found with the following INVALID_OPENING_QUOTE error :

CsvError: Invalid Opening Quote: a quote is found on field "id;id_fantoir;numero;rep;nom_voie;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;type_position;alias;nom_ld;libelle_acheminement;nom_afnor;source_position;source_nom_voie;certification_commune;cad_parcelles" at line 296718, value is "02727_b017_00001;02727_B017;1;;" at Object.parse (/code/pelias/csv-importer/node_modules/csv-parse/dist/cjs/index.cjs:791:21) at Parser._transform (/code/pelias/csv-importer/node_modules/csv-parse/dist/cjs/index.cjs:1333:26) at Parser.Transform._write (node:internal/streams/transform:205:23) at writeOrBuffer (node:internal/streams/writable:391:12) at _write (node:internal/streams/writable:332:10) at Parser.Writable.write (node:internal/streams/writable:336:10) at ReadStream.ondata (node:internal/streams/readable:754:22) at ReadStream.emit (node:events:513:28) at addChunk (node:internal/streams/readable:315:12) at readableAddChunk (node:internal/streams/readable:289:9) Emitted 'error' event on Parser instance at: at Parser.onerror (node:internal/streams/readable:773:14) at Parser.emit (node:events:513:28) at emitErrorNT (node:internal/streams/destroy:157:8) at emitErrorCloseNT (node:internal/streams/destroy:122:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) { code: 'INVALID_OPENING_QUOTE',

However, it seems that the example above is compliant with the CSV format according to the rfc : https://www.ietf.org/rfc/rfc4180.txt image

Steps to Reproduce

  1. Use a csv file with records like the following containing : id;id_fantoir;numero;rep;nom_voie;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;type_position;alias;nom_ld;libelle_acheminement;nom_afnor;source_position;source_nom_voie;certification_commune;cad_parcelles 02727_b017_00001;02727_B017;1;;"Lieu dit ""Entre deux Villes""";02270;02727;Sons-et-Ronchères;;;749035.43;6962499.63;3.680096;49.759545;logement;;;SONS ET RONCHERES;LIEU DIT ENTRE DEUX VILLES ;commune;commune;1;02727000ZB0042
  2. Launch elasticsearch : pelias elastic start
  3. Create the elasticsearch index : pelias elastic create
  4. Launch the import job : pelias import csv

Environment (please complete the following information): Linux debian-like

References

A similar issue seems to have been fixed in pelias/transit with the help of the "relax" parameter from csv-parse : https://github.com/pelias/transit/pull/46 And it seems the parameter looks already in place into the csv-importer/lib/streams/recordStream.js file : 49: relax: true,

But do you think the solution could be to rename the parameter from relax to relax_quotes in the csv-parser instance according to the following ? https://stackoverflow.com/questions/70880341/csv-parse-is-throwing-invalid-opening-quote-a-quote-is-found-inside-a-field-at https://stackoverflow.com/questions/73769717/csverror-invalid-opening-quote-a-quote-is-found-inside-a-field-at-line-9618?rq=3

Thanks a lot for your help!

missinglink commented 1 week ago

We are using a well tested csv library which seems to support this by default: https://csv.js.org/parse/options/escape/

The file seems to be semi-colon delimited rather than comma-delimited, I don't recall if that is detected/supported.

Could you please upload the CSV file or email it to me so I can see what exactly is causing it?

missinglink commented 1 week ago

Ideally we could add a small unit test which triggers this behavior.

nick-rv commented 1 week ago

Could you please upload the CSV file or email it to me so I can see what exactly is causing it?

Sure, here is the download link : https://adresse.data.gouv.fr/data/ban/adresses/latest/csv/adresses-france.csv.gz What catches my attention is that the first double-quote character inside the file seems to cause the error.

Thanks a lot!

missinglink commented 1 week ago

I agree the file seems valid according to a couple of open-source tools written in different languages:

xsv count -d ";" adresses-france.csv
26049046
csvlint --delimiter=';' adresses-france.csv
Warning: not using defaults, may not validate CSV to RFC 4180
file is valid
missinglink commented 1 week ago

Although both tools fail with similar errors to OP when incorrectly specifying the delimiter:

xsv count adresses-france.csv
CSV error: record 86898 (line: 86899, byte: 16594179): found record with 3 fields, but the previous record has 1 fields
/tmp # csvlint adresses-france.csv
Record #86898 has error: wrong number of fields
Record #86899 has error: wrong number of fields
Record #86900 has error: wrong number of fields
Record #86901 has error: wrong number of fields
Record #86902 has error: wrong number of fields
Record #86903 has error: wrong number of fields
Record #86904 has error: wrong number of fields
Record #86905 has error: wrong number of fields
Record #86906 has error: wrong number of fields
Record #86907 has error: wrong number of fields
Record #86908 has error: wrong number of fields
Record #86909 has error: wrong number of fields
Record #86910 has error: wrong number of fields
Record #86911 has error: wrong number of fields
Record #86912 has error: wrong number of fields
Record #258188 has error: bare " in non-quoted-field

unable to parse any further
nick-rv commented 1 week ago

Moreover it seems that the single quote character (apostrophe is frequent in french) brings the same error.

Logs from pelias import csv :

CsvError: Invalid Opening Quote: a quote is found on field "id;nom_lieu_dit;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;source_position;source_nom_voie" at lin e 920, value is "01307_11cqwi;" at Object.parse (/code/pelias/csv-importer/node_modules/csv-parse/dist/cjs/index.cjs:791:21) at Parser._transform (/code/pelias/csv-importer/node_modules/csv-parse/dist/cjs/index.cjs:1333:26) at Parser.Transform._write (node:internal/streams/transform:205:23) at writeOrBuffer (node:internal/streams/writable:391:12) at _write (node:internal/streams/writable:332:10) at Parser.Writable.write (node:internal/streams/writable:336:10) at ReadStream.ondata (node:internal/streams/readable:754:22) at ReadStream.emit (node:events:513:28) at addChunk (node:internal/streams/readable:315:12) at readableAddChunk (node:internal/streams/readable:289:9) Emitted 'error' event on Parser instance at: at Parser.onerror (node:internal/streams/readable:773:14) at Parser.emit (node:events:513:28) at emitErrorNT (node:internal/streams/destroy:157:8) at emitErrorCloseNT (node:internal/streams/destroy:122:3) at processTicksAndRejections (node:internal/process/task_queues:83:21) { code: 'INVALID_OPENING_QUOTE', bytes: 96704, comment_lines: 0, empty_lines: 0, invalid_field_length: 1, lines: 920, records: 918, columns: [ { name: 'id;nom_lieu_dit;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;source_position;source_nom_voie' } ], error: undefined, header: false, index: 0, raw: undefined, column: 'id;nom_lieu_dit;code_postal;code_insee;nom_commune;code_insee_ancienne_commune;nom_ancienne_commune;x;y;lon;lat;source_position;source_nom_voie', quoting: false, field: '01307_11cqwi;' }

Example values:

01004_zdi7me_00001;;1;;Rue des Combattants d'Indochine 1945 1954;01500;01004;Ambérieu-en-Bugey;;;882824.27;6542190.48;5.3607;45.954827;entrée;;;AMBERIEU-EN-BUGEY;RUE C INDOCHINE 1945 1954;commune;commune;0; 01004_zdi7me_00003;;3;;Rue des Combattants d'Indochine 1945 1954;01500;01004;Ambérieu-en-Bugey;;;882814.23;6542181.63;5.360567;45.95475;entrée;;;AMBERIEU-EN-BUGEY;RUE C INDOCHINE 1945 1954;commune;commune;0; 01004_zdi7me_00005;;5;;Rue des Combattants d'Indochine 1945 1954;01500;01004;Ambérieu-en-Bugey;;;882799.02;6542169.29;5.360366;45.954643;segment;;;AMBERIEU-EN-BUGEY;RUE C INDOCHINE 1945 1954;commune;commune;0; 01004_zdi7me_00007;;7;;Rue des Combattants d'Indochine 1945 1954;01500;01004;Ambérieu-en-Bugey;;;882796.33;6542165.98;5.36033;45.954614;entrée;;;AMBERIEU-EN-BUGEY;RUE C INDOCHINE 1945 1954;commune;commune;0;

Do you think that allowing to redefine the field delimiter could allow to overcome this kind of issues ? https://csv.js.org/parse/options/delimiter/

Thanks!

missinglink commented 1 week ago

Ok, so moving forward, if this is an actual bug, it's going to lie within the CSV parser library rather than Pelias, so there's little we can do on our end except modify the config.

I'm still suspecting the semi-colons, I don't find anywhere in the docs saying that is supported, so it's just an assumption at this stage that the library will auto-detect the delimiter, I also didn't find any mention of that in their docs.

So, as a next step, can you please rewrite the semi-colon delimited file as comma delimited and try that? Please report back if the error persists.

Something like this:

zcat adresses-france.csv.gz | xsv input -d ';' | gzip -9 > adresses-france.commas.csv.gz

zcat adresses-france.commas.csv.gz | head -n2
id,id_fantoir,numero,rep,nom_voie,code_postal,code_insee,nom_commune,code_insee_ancienne_commune,nom_ancienne_commune,x,y,lon,lat,type_position,alias,nom_ld,libelle_acheminement,nom_afnor,source_position,source_nom_voie,certification_commune,cad_parcelles
01001_4b50r5_00630,,630,,la Chèvre,01400,01001,L'Abergement-Clémenciat,,,847780.79,6560977.36,4.914282,46.132481,segment,,,L'ABERGEMENT-CLEMENCIAT,LA CHEVRE,inconnue,inconnue,0,
missinglink commented 1 week ago

Do you think that allowing to redefine the field delimiter could allow to overcome this kind of issues ?

Yes possibly, if you're able to resolve the issue by manually rewriting the delimiters then we can consider how we might make that configurable.

missinglink commented 1 week ago

IMO this convention is a bit silly and confusing, it's common in Europe simply because the comma is used instead of the period to separate the Euro from the Cent.

My bank does it to my statements here in Germany and I understand why, but it's confusing that the file extension is still .csv & seems to have been adopted by Government officials without consideration for how the consuming program is supposed to detect the format.

For tab separated files there is a growing use of .tsv yet nothing AFAIK for semi-delim files 🤷‍♂️ 😢

orangejulius commented 1 week ago

Hey folks

I believe the issue here is that delimiters must be specified when configuring the CSV parsing library we are using, as shown in their docs. We don't set the delimiter when configuring, though I think it would be possible.

The only thing that might be tricky is that we would have to add some way in our pelias.json config to both specify multiple CSV files (already possible, though usually a directory is given and all CSV files within the directory are imported), and to set a delimiter for each one. That's a lot of new configuration we don't currently have.

I could see us theoretically adding support for this, though please discuss the potential format with us before doing any work on a PR.

That said, I think the easiest practical way to proceed is to convert your file to comma delimited. Assuming any commas in your actual data are quoted appropriately, everything should work.

nick-rv commented 1 week ago

Thanks for your support @missinglink @orangejulius , once the separator have been switched to commas the import occurred successfully.

As the separator seems to be redefinable when instantiating csv-parser, it may look interesting to bring this attribute configurable. Because i guess that it could ease data indexation among the numerous languages and address types that pelias may be facing.

For example, a french address can usually contain commas to separate elements like the house number, the street name and postal code: "10, Avenue des Champs-Elysées, 75008 Paris"

Thanks again!

missinglink commented 1 week ago

Good to hear you got it working.

Let's close this issue and open a new one to discuss some way of configuring the delimiter.

nick-rv commented 1 week ago

The feature request is open : https://github.com/pelias/csv-importer/issues/111 Thanks!