metmuseum / openaccess

The Metropolitan Museum of Art's Open Access Initiative
Creative Commons Zero v1.0 Universal
1.18k stars 212 forks source link

Some lines truncated #3

Open ewg118 opened 7 years ago

ewg118 commented 7 years ago

Some lines appear to be truncated (see 39.40.127.61), which is causing problems for viewing the CSV in spreadsheet software. The truncation typically occurs in the dimensions or medium column.

msimos commented 7 years ago

The problem is where one line spans multiple lines. So there is an erroneous line break with in a single CSV entry. I created this perl script to fix the problem and merge the multiple lines in a single line. This works on Linux but I haven't tested this on any other operating system so YMMV.

#!/usr/bin/perl

use strict;
use warnings;

if ($#ARGV != 0) {
     print "\nSpecify filename\n";
         exit;
}

open my $fh, "<", $ARGV[0];
while (my $row = <$fh> ) {
   $row=~ s/\r//g;
   $row=~ s/\n//g;
   if($row =~ /^.*NY\"$/) {
      printf("%s\n",$row);
   } 
   else {
      printf("%s",$row);
   }
}
abetusk commented 7 years ago

@msimos, I haven't investigated too deeply, but I think your script has some problems. I found the second line gets concatenated onto the end of the first line with the header. I suspect that there might be other issues as well. I'm on a Linux system using d82dd833ba7913a127f5449a760e9806a72684ce.

Here's my take (taken from here, AGPLv3):

#!/usr/bin/python

import re
import csv
import sys

if len(sys.argv)<2:
  print "provide file"
  sys.exit(0)

if sys.argv[1] == "-":
  fp = sys.stdin
else:
  fp = open(sys.argv[1])

header_count = 0
overflow = False
cur_count=0
cur_row = []

reader = csv.reader(fp, delimiter=',')
for row in reader:
  if header_count==0:
    header_count = len(row)

  cur_count+=len(row)

  for r in row:
    z = re.sub("\n", "\\\\n", r)
    z = re.sub("\t", "\\\\t", z)
    cur_row.append(z)

  if cur_count!=header_count:
    overflow = True
  else:
    overflow =False
    print "\t".join(cur_row)
    cur_count=0
    cur_row = []

This will change from a comma delimited file to a tab delimited one and should escape new lines and tabs with "C-style" escaping, creating backslashed escaped characters (for example, new lines would become the literal string '\n'). It doesn't handle carriage returns but that could easily be added.

I also had some problems with weird characters showing up in the original file. For example, I see <U+FEFF> as the first character. I'm not sure if this has a big effect but I decided to take them out with the following (taken from an SO answer):

./fmt_lines.py MetObjects.csv > MetObjects-clean.csv
LANG=C sed -i 's/[\d128-\d255]//g' MetObjects-clean.csv
msimos commented 7 years ago

@abetusk I forgot to mention I removed the header since I didn't need it for my use case. So that's why you found the first line concatenated to the second line. I did a line count of your script vs mine and they seem to come out the same (minus the header). Thanks for sharing!

abetusk commented 7 years ago

@msiomos, yours is a lot faster because it doesn't parse the whole line, like mine does. csvtool isn't able to parse any version of the file so it'll take more work to figure out if both files are actually the same. Thank you for sharing!