umd-mith / extremist-files

JSON and CSV for Southern Poverty Law Center's Hate Map
5 stars 2 forks source link

duplicate ids in block list #2

Open jsonstein opened 7 years ago

jsonstein commented 7 years ago

there appear to be seven (7) records which have duplicate entries:

as revealed by the command:

cat splc-blocklist.csv | sort | uniq -d

and as counted by the command:

cat splc-blocklist.csv | sort | uniq -d | wc -l
edsu commented 7 years ago

Interesting, the usernames are coming directly from our spreadsheet where there are in fact some duplicate names:

curl --silent 'https://docs.google.com/a/umd.edu/spreadsheets/d/1LsJHAdSexX4yoYq_Pgfb7XWZgRmBuCcS-7QEETfHxlA/export?format=csv' | csvcut -c Twitter | sort | uniq -d

?
https://twitter.com/RoperBilly
https://twitter.com/aryan_brother
https://twitter.com/kevin_a_strom
https://twitter.com/nsm88
https://twitter.com/worldnetdaily
jsonstein commented 7 years ago

you might also find these tools to be useful

"These tools were developed for working with reasonably large data files. Perhaps larger than ideal for direct use in an application like R, but not so big as to necessitate moving to Hadoop or similar distributed compute environments"

and they are quite fast

https://github.com/eBay/tsv-utils-dlang

(I use csv2tsv regularly to clean cruft)

I have also been using string-based data munging exercises as a way for me to better learn the D programming language and you may (or may not ;^) find either tsv2json or prettyprintJSON to be useful from here (must add CC-attrib license statement):

https://github.com/jsonstein/tsv2json

jeffs

Jeff Sonstein Assoc. Prof. (ret'd) College of Computing, R.I.T.

On Oct 27, 2016, at 12:47 PM, Ed Summers notifications@github.com wrote:

Interesting, the usernames are coming directly from our spreadsheet where there are in fact some duplicate names:

curl --silent 'https://docs.google.com/a/umd.edu/spreadsheets/d/1LsJHAdSexX4yoYq_Pgfb7XWZgRmBuCcS-7QEETfHxlA/export?format=csv' | csvcut -c Twitter | sort | uniq -d

? https://twitter.com/RoperBilly https://twitter.com/aryan_brother https://twitter.com/kevin_a_strom https://twitter.com/nsm88 https://twitter.com/worldnetdaily — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.