ukwa / w3act

w3act is an annotation and curation tool for building web archive collections
Apache License 2.0
19 stars 6 forks source link

Replace comma in CSV exports with a tab #681

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

The current export code writes out CSV, but does not handle escaping of commas at all! To fix this properly, a CSV writing library should be used, and all the CSV export code re-written.

Alternatively, while the export code is not well-written in general, it does at least appear to use a specific variable for the CSV SEPARATOR, rather than having commas everywhere:

https://github.com/ukwa/w3act/blob/6ad35fff95854edd847c279904534938ff73295d/app/uk/bl/Const.java#L358-L359

As an interim measure, we could just replace that with a \t tab. That should fix all CSV outputs immediately, as long as everyone who uses those outputs is aware of the change.

min2ha commented 1 year ago

As per RFC 4180:

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

Examples how it should look (as per the CSV spec): "Anglo Nubian Goat Society, The" Bat Ecology and Bioacoustics Lab

Alternatively, escape function seems to work fine

Actually not sure if that's the same issue, but expectation was to fix the bug there the title (if it has comma) been splited and drifted into other column

crarugal commented 1 year ago

testing: https://dev.webarchive.org.uk/act/targets/list, looking at "100 Best Sites" collection: image

Original csv export: image

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

nid | title | field_url | author | field_crawl_frequency | created |   |   -- | -- | -- | -- | -- | -- | -- | -- 2902 | 38 Degrees | https://speakout.38degrees.org.uk/ | https://you.38degrees.org.uk/ | http://www.38degrees.org.uk/ | Nicola Bingham | QUARTERLY | 2013-07-22T15:02:52Z 2963 | Air BnB | https://www.airbnb.co.uk/ | https://www.airbnb.co.uk/sitemaps | Nicola Bingham | MONTHLY | 2013-07-24T13:30:41Z 2984 | Amazon | https://www.amazon.co.uk/ | https://www.amazon.co.uk/gp/site-directory?ie=UTF8&ref_=nav_shopall_fullstore | Nicola Bingham | DAILY | 2013-07-25T13:02:14Z 2904 | Anarchist Federation | http://www.afed.org.uk/ | Nicola Bingham | SIXMONTHLY | 2013-07-22T15:15:47Z 2983 | Argos.co.uk | http://www.argos.co.uk | Nicola Bingham | WEEKLY | 2013-07-25T12:57:13Z 2974 | Artsthread | http://www.artsthread.com/ | Nicola Bingham | QUARTERLY | 2013-07-25T09:30:46Z

Dev csv export: image

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

nid title field_url author field_crawl_frequency created |   |   -- | -- | -- 2902 38 Degrees https://speakout.38degrees.org.uk/ | https://you.38degrees.org.uk/ | http://www.38degrees.org.uk/ Nicola Bingham QUARTERLY 2013-07-22T15:02:52Z |   2963 Air BnB https://www.airbnb.co.uk/ | https://www.airbnb.co.uk/sitemaps Nicola Bingham WEEKLY 2013-07-24T13:30:41Z |   2984 Amazon http://www.amazon.co.uk/ | https://www.amazon.co.uk/gp/site-directory?ie=UTF8&ref_=nav_shopall_fullstore Nicola Bingham DAILY 2013-07-25T13:02:14Z |   2904 Anarchist Federation http://www.afed.org.uk/ Nicola Bingham SIXMONTHLY 2013-07-22T15:15:47Z |   |   2983 Argos.co.uk http://www.argos.co.uk Nicola Bingham WEEKLY 2013-07-25T12:57:13Z |   |   2974 Artsthread http://www.artsthread.com/ Nicola Bingham QUARTERLY 2013-07-25T09:30:46Z |   |  

Tabbed export fails

crarugal commented 1 year ago

Just realised it exported as .csv, which is why it failed to parse as expected. Saving the file as .tsv works: image

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

nid | title | field_url | author | field_crawl_frequency | created |   -- | -- | -- | -- | -- | -- | -- 2902 | 38 Degrees | https://speakout.38degrees.org.uk/,https://you.38degrees.org.uk/,http://www.38degrees.org.uk/ | Nicola Bingham | QUARTERLY | 2013-07-22T15:02:52Z 2963 | Air BnB | https://www.airbnb.co.uk/,https://www.airbnb.co.uk/sitemaps | Nicola Bingham | WEEKLY | 2013-07-24T13:30:41Z 2984 | Amazon | http://www.amazon.co.uk/,https://www.amazon.co.uk/gp/site-directory?ie=UTF8&ref_=nav_shopall_fullstore | Nicola Bingham | DAILY | 2013-07-25T13:02:14Z 2904 | Anarchist Federation | http://www.afed.org.uk/ | Nicola Bingham | SIXMONTHLY | 2013-07-22T15:15:47Z 2983 | Argos.co.uk | http://www.argos.co.uk | Nicola Bingham | WEEKLY | 2013-07-25T12:57:13Z 2974 | Artsthread | http://www.artsthread.com/ | Nicola Bingham | QUARTERLY | 2013-07-25T09:30:46Z 3002 | Barbican | http://www.barbican.org.uk/,https://www.barbican.org.uk/sitemap.xml?page=1,https://www.barbican.org.uk/sitemap.xml?page=2,https://www.barbican.org.uk/sitemap.xml?page=3 | admin | MONTHLY | 2013-07-25T21:06:20Z

We might have to export as .tsv as default, instead of .csv

nicolabingham commented 1 year ago

Works on dev

crarugal commented 1 year ago

I exported image Which is automatically saved in download as: image

However, this format still causes issues. Trying to import to excel, in original file format:

Comma separated image

Tab separated: image

The previous solution worked well, but needed to be exported as .TSV

crarugal commented 1 year ago

Exporting this collection to test outputs: image

Exporting as CSV: I'm not sure if this is desired behaviour, but opening up in excel, we can see that the data doesn't match the headers: image

Exporting as TSV: this works as expected image

Exporting as JSON: formatted as expected: image

crarugal commented 1 year ago

The exports are really good, the only thing I'm not sure of is the CSV export. It works as expected, but perhaps it can never be parsed as we would hope, due to the multiple secondary seeds that are separated by commas. I think this should still pass as a fix because the TSV and JSON exports are better formats which work. Trying the exports on large collections, such as Brexit, also works perfectly. Thanks, @min2ha

Brexit TSV: image

Brexit JSON: image