vega / vega-datasets

Common repository for example datasets used by Vega-related projects
261 stars 207 forks source link

Flights 3m only has 200k rows #607

Open domoritz opened 2 weeks ago

domoritz commented 2 weeks ago

https://github.com/vega/vega-datasets/blob/main/data/flights-3m.csv seems to only have 200k rows.

wc -l flights-3m.csv
  231084 flights-3m.csv

Added in https://github.com/vega/vega-datasets/commit/1e70098e5c15069314a1be82a37c82c0fbb5f66f by @arvind

dsmedia commented 2 weeks ago

Looks like the count in flights_200k may also be off.

from vega_datasets import data

datasets = ['flights_2k', 'flights_5k', 'flights_10k', 'flights_20k', 'flights_200k', 'flights_3m']

for dataset_name in datasets:
    dataset = getattr(data, dataset_name)()
    row_count = len(dataset)
    print(f"{dataset_name}: {row_count} rows")

Results:

flights_2k: 2000 rows
flights_5k: 5000 rows
flights_10k: 10000 rows
flights_20k: 20000 rows
flights_200k: 231083 rows
flights_3m: 231083 rows

We can regenerate 3m rows using this script, create a csv from the 3m parquet file here or something else?