Some realistic tabular datasets for testing (CSV)
The datasets are gzipped, you can unzip them under Linux and macOS with the gunzip program. Windows users can use 7-Zip. Mac users should be able to just double-click on the files to uncompress them.
These data sets have been used in several academic papers.
File: census-income.data.gz 5.7MB
Census-Income is a relatively small data set with 100 MB and 199 523 records. However, it has 42 columns and one column has a very high relative cardinality (99 800 distinct values).
We include a subset (census-income.data.d241850.csv.gz) made of 4 columns: age, wage per hour, dividends from stocks and a numerical value found in the 25th position of the original data set. The respective cardinalities are 91, 1 240, 1 478 and 99 800.
Source:
File: census1881.csv.gz 33MB
Census 1881 comes from the Canadian census of 1881: it has over 4 million records. Census1881 came from a publicly available SPSS file 1881 sept2008 SPSS.rar that we converted to a flat file. In the process, we replaced the special values “ditto” and “do.” by the repeated value, and we deleted all commas within values. The column cardinalities are 183, 2 127, 2 795, 8 837, 24 278, 152 365, 152882.
Source:
File: weather_sept_85.csv.gz 15MB
It consists of surface synoptic weather reports from land stations for September 1985.
Source:
File: wikileaks-noquotes.csv.gz 5.9MB
The Wikileaks table was created from a public repository published by Google and it contains the non-classified metadata related to leaked diplomatic cables. We extracted 4 columns: year, time, place and descriptive code. It has 1 178 559 records. Our Wikileaks table has column cardinalities 273, 1440, 3935 and 4865.
Source:
File: census-income_srt.csv.gz
File: wikileaks-noquotes_srt.csv.gz
File: weather_sept_85_srt.csv.gz
File: census1881_srt.csv.gz
We sorted the tables lexicographically, with the smallest cardinality column being the primary sort key, the next-smallest cardinality column being the secondary sort key, and so forth.
References:
If you just want short tabular datasets for machine learning purposes, there are good choices elsewhere such as adult.
The Web Table Corpora is interesting.
See Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018.