microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
15.26k stars 2.61k forks source link

Qlib to load csv minutes level trading data #1775

Open dReamix opened 6 months ago

dReamix commented 6 months ago

Hi there,

New to use Qlib but I did look up my questions online and asked LLM, no solutions so far.

Here are what I am facing:

I have 1min level trading data in more than 10 csv files, each file is over 500MB. All the csv files follow same format, [instrument, time, open, high, low, close, volume, turnover, is_paused]. In this case column 'instrument' saves asset code, so one file will have tons of stock code. Column 'time' saves trading time stamp, e.g. '1/2/2019 9:53:00 AM'.

Problems:

1, All the csv files are in one folder, I tried run 'python dump_bin.py dump_all --csv_path 'csv file folder path' --qlib_dir 'target file path' --symbol_field_name instrument --date_field_name time --include_fields open,high,low,close,volume,turnover,is_paused'.

then the system returned 'concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.'

Is this because of short of memory? (file size too large? )

because I tried to put only one csv file in the folder then the 'python dump_bin.py' worked, partially.

  1. After I 'successfully' ran 'python dump_bin.py', I checked the qlib data dir. there are 3 folder, calendar, features, and instruments. However in folder instruments, I only see an 'all.txt' file, and it has only one row, the csv file name, start date and end date.

There is a 'day.txt' in calendar folder, but it only save date level data, e.g. '2019-01-02', there is no minute.

Appreciated if anyone could share your advice!

SunsetWolf commented 5 months ago

I think your csv file needs some preprocessing before it can be converted to a bin file, with the following caveats. One thing to keep in mind is to categorize the data by stock code and name the file after the stock code. e.g. SH600000.csv The time column needs to be converted from 12 hours to 24 hours. e.g. 2010-12-01 14:34:00 When dump_bin you need to use --date_field_name to specify the time column, --symbol_field_name to specify the stock code column, use --exclude_fields to exclude the stock code column and the time column, because qlib will store them in its own way.