unsplash / datasets

šŸŽ 5,400,000+ Unsplash images made available for research and machine learning
https://unsplash.com/data
2.4k stars 117 forks source link

Clarification of embedded newlines in TSV files #29

Closed copiousfreetime closed 4 years ago

copiousfreetime commented 4 years ago

The keywords data file appears to have an embedded newline in one of the records. I just want to clarify if this is expected or not. It looks like the given psql loading instructions do account for newlines in the TSV file, but if folks are processing the file outside of that without using quote-escaping rules they may process the data incorrectly.

To Reproduce

% wc -l *.tsv000
1646598 collections.tsv000
4075505 conversions.tsv000
2689741 keywords.tsv000
25001 photos.tsv000
8436845 total

Load the according to the documented instructions:

 % psql -h localhost -U jeremy -d unsplash_lite -f load-data-client.sql
    COPY 25000
    COPY 2689739 # <-- Hmm.. this one is NOT 1 less than keywords.tsv000 above
    COPY 1646597
    COPY 4075504

Check the db row count

unsplash_lite=# select count(*) from unsplash_keywords;
  count
---------
 2689739
(1 row)

Expected behavior

I initially expected there to be 1 record for each non-header line of TSV, this appears to be an incorrect assumption. It looks like the psql commandline parsed the TSV according to quoted escape rules, so that is good.

I wrote a program to check the keywords file and it reports

% ruby check-tsv.rb keywords.tsv000
Headers: photo_id -- keyword -- ai_service_1_confidence -- ai_service_2_confidence -- suggested_by_user
[1590611 - PF4s20KB678-"fujisan] parts count 2 != 5
[1590612 - mount fuji"-] parts count 4 != 5
lines in file   : 2689741
data lines      : 2689740
unique row count: 2689740

Then looking at the lines around line 1590610 we see:

 % sed -n '1590610,1590615p' keywords.tsv000
PF4s20KB678     night   22.3271160125732                f
PF4s20KB678     "fujisan
mount fuji"                     t
PF4s20KB678     pier    22.6900939941406                f
PF4s20KB678     viaduct 30.6490669250488                f
PF4s20KB678     architecture    33.084938049316399              f

And the db reports that row and the preceding and following rows correctly loaded.

unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%fujisan%';
  photo_id   |  keyword   | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+------------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | fujisan   +|                         |                         | t
             | mount fuji |                         |                         |
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%pier%';
  photo_id   | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | pier    |        22.6900939941406 |                         | f
(1 row)

unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%night%';
  photo_id   | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
 PF4s20KB678 | night   |        22.3271160125732 |                         | f

If folks are processing these TSV simplistically without using quote-escaping logic then they may process the files incorrectly. I don't want folks to encounter that. And maybe this points to and upstream data input issue, if users are entering newlines in the keyword input - how are they getting processed in the main app.

We may just want to document that there can be embedded newlines in the TSV files.

Thanks!

TimmyCarbone commented 4 years ago

Thanks a lot for the detailed troubleshooting!

I think we definitely need some cleanup here. I don't think a newline character is to be expected in the keywords although it's understandable how and why it can be there (this is a keyword suggested by a user, so user input).

I'll make sure we do some cleanup and maybe just transform newline characters into spaces for the next version. šŸ‘