Open ewg118 opened 7 years ago
The problem is where one line spans multiple lines. So there is an erroneous line break with in a single CSV entry. I created this perl script to fix the problem and merge the multiple lines in a single line. This works on Linux but I haven't tested this on any other operating system so YMMV.
#!/usr/bin/perl
use strict;
use warnings;
if ($#ARGV != 0) {
print "\nSpecify filename\n";
exit;
}
open my $fh, "<", $ARGV[0];
while (my $row = <$fh> ) {
$row=~ s/\r//g;
$row=~ s/\n//g;
if($row =~ /^.*NY\"$/) {
printf("%s\n",$row);
}
else {
printf("%s",$row);
}
}
@msimos, I haven't investigated too deeply, but I think your script has some problems. I found the second line gets concatenated onto the end of the first line with the header. I suspect that there might be other issues as well. I'm on a Linux system using d82dd833ba7913a127f5449a760e9806a72684ce.
Here's my take (taken from here, AGPLv3):
#!/usr/bin/python
import re
import csv
import sys
if len(sys.argv)<2:
print "provide file"
sys.exit(0)
if sys.argv[1] == "-":
fp = sys.stdin
else:
fp = open(sys.argv[1])
header_count = 0
overflow = False
cur_count=0
cur_row = []
reader = csv.reader(fp, delimiter=',')
for row in reader:
if header_count==0:
header_count = len(row)
cur_count+=len(row)
for r in row:
z = re.sub("\n", "\\\\n", r)
z = re.sub("\t", "\\\\t", z)
cur_row.append(z)
if cur_count!=header_count:
overflow = True
else:
overflow =False
print "\t".join(cur_row)
cur_count=0
cur_row = []
This will change from a comma delimited file to a tab delimited one and should escape new lines and tabs with "C-style" escaping, creating backslashed escaped characters (for example, new lines would become the literal string '\n'). It doesn't handle carriage returns but that could easily be added.
I also had some problems with weird characters showing up in the original file. For example, I see <U+FEFF>
as the first character. I'm not sure if this has a big effect but I decided to take them out with the following (taken from an SO answer):
./fmt_lines.py MetObjects.csv > MetObjects-clean.csv
LANG=C sed -i 's/[\d128-\d255]//g' MetObjects-clean.csv
@abetusk I forgot to mention I removed the header since I didn't need it for my use case. So that's why you found the first line concatenated to the second line. I did a line count of your script vs mine and they seem to come out the same (minus the header). Thanks for sharing!
@msiomos, yours is a lot faster because it doesn't parse the whole line, like mine does. csvtool
isn't able to parse any version of the file so it'll take more work to figure out if both files are actually the same. Thank you for sharing!
Some lines appear to be truncated (see 39.40.127.61), which is causing problems for viewing the CSV in spreadsheet software. The truncation typically occurs in the dimensions or medium column.