This is the official repository of our NeurIPS 2023 DB Track paper LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting. LargeST comprises four sub-datasets, each characterized by a different number of sensors. The biggest one is California (CA), including a total number of 8,600 sensors. We also construct three subsets of CA by selecting three representative areas within CA and forming the sub-datasets of Greater Los Angeles (GLA), Greater Bay Area (GBA), and San Diego (SD). The figure here shows an illustration.
In LargeST we also provide comprehensive metadata for all sensors, which are listed below. | Attribute | Description | Possible Range of Values |
---|---|---|---|
ID | The identifier of a sensor in PeMS | 6 to 9 digits number | |
Lat | The latitude of a sensor | Real number | |
Lng | The longitude of a sensor | Real number | |
District | The district of a sensor in PeMS | 3, 4, 5, 6, 7, 8, 10, 11, 12 | |
County | The county of a sensor in California | String | |
Fwy | The highway where a sensor is located | String starts with 'I', 'US', or 'SR' | |
Lane | The number of lanes where a sensor is located | 1, 2, 3, 4, 5, 6, 7, 8 | |
Type | The type of a sensor | Mainline | |
Direction | The direction of the highway | N, S, E, W |
In this section, we will outline the procedure for preparing the CA dataset, followed by an explanation of how the GLA, GBA, and SD datasets can be derived from CA. Please follow these instructions step by step.
We host the CA dataset on Kaggle: https://www.kaggle.com/datasets/liuxu77/largest. There are a total of 7 files in this link. Among them, 5 files in .h5 format contain the traffic flow raw data from 2017 to 2021, 1 file in .csv format provides the metadata for all sensors, and 1 file in .npy format represents the adjacency matrix constructed based on road network distances.
If you are using the web user interface, you can download all data from the provided link. The download button is at the upper right corner of the webpage. Then please place the downloaded archive.zip file in the data/ca
folder and unzip the file.
If you would like to use the Kaggle API, please follow the instructions here. After setting the API correctly, you can simply go to the data/ca
folder, and use the command below to download all data.
kaggle datasets download liuxu77/largest
Note that the traffic flow raw data of the CA dataset require additional processing (described in Section 1.2 and 1.3), while the metadata and adjacency matrix are ready to be used.
We provide a jupyter notebook process_ca_his.ipynb
in the folder data/ca
to process and generate a cleaned version of the flow data. Please go through this notebook.
Please go to the data
folder, and use the command below to generate the flow data for model training in our manuscript.
python generate_data_for_training.py --dataset ca --years 2019
The processed data are stored in data/ca/2019
. We also support the utilization of data from multiple years. For example, changing the years argument to 2018_2019 to generate two years of data.
We describe the generation of the GLA dataset as an example. Please first go through all the cells in the provided jupyter notebook generate_gla_dataset.ipynb
in the folder data/gla
. Then, use the command below to generate traffic flow data for model training.
python generate_data_for_training.py --dataset gla --years 2019
We conduct experiments on an Intel(R) Xeon(R) Gold 6140 CPU @ 2.30 GHz, 376 GB RAM computing server, equipped with an NVIDIA RTX A6000 GPU with 48 GB memory. We adopt PyTorch 1.12 as the default deep learning library. Currently, there are a total of 12 supported baselines in this repository, namely, Historical Last (HL), LSTM, DCRNN, AGCRN, STGCN, GWNET, ASTGCN, STTN, STGODE, DSTAGNN, DGCRN, and D2STGNN.
To reproduce the benchmark results in the manuscript, please go to experiments/baseline_you_want_to_run
, open the provided run.sh
file, and uncomment the line you would like to execute. Note that you may need to specify the GPU card number on your server. Moreover, we use the flow data from 2019 for model training in our manuscript, if you want to use multiple years of data, please change the years argument to, e.g., 2018_2019.
To run the LSTM baseline, for example, you may execute this command in the terminal:
bash experiments/lstm/run.sh
or directly execute the Python file in the terminal:
python experiments/lstm/main.py --device cuda:2 --dataset SD --years 2019 --model_name lstm --seed 2023 --bs 64
You may first go through the implementations of various baselines in our repository, which may serve as good references for experimenting your own model. The detailed steps are described as follows.
src/models
. To ensure compatibility with the existing framework, it is recommended that your model inherits the BaseModel class (implemented in src/base/model.py
).src/base/engine.py
), you can directly use it for training and evaluation. Otherwise, please include a file in the folder src/engines
.main.py
file in the experiments/your_model_name
directory.The LargeST benchmark dataset is released under a CC BY-NC 4.0 International License: https://creativecommons.org/licenses/by-nc/4.0. Our code implementation is released under the MIT License: https://opensource.org/licenses/MIT. The license of any specific baseline methods used in our codebase should be verified on their official repositories. Here we would also like to express our gratitude to the authors of baselines for releasing their code.
If you find our work useful in your research, please cite:
@inproceedings{liu2023largest,
title={LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting},
author={Liu, Xu and Xia, Yutong and Liang, Yuxuan and Hu, Junfeng and Wang, Yiwei and Bai, Lei and Huang, Chao and Liu, Zhenguang and Hooi, Bryan and Zimmermann, Roger},
booktitle={Advances in Neural Information Processing Systems},
year={2023}
}