microsoft / qlib

Qlib is an AI-oriented quantitative investment platform that aims to realize the potential, empower research, and create value using AI technologies in quantitative investment, from exploring ideas to implementing productions. Qlib supports diverse machine learning modeling paradigms. including supervised learning, market dynamics modeling, and RL.
https://qlib.readthedocs.io/en/latest/
MIT License
14.55k stars 2.53k forks source link

Ticker name "NA" makes the exists_qlib_data function report errors. #1720

Open OzzyXu opened 6 months ago

OzzyXu commented 6 months ago

🐛 Bug Description

The ticker name "NA" in the "all.txt" under /instruments makes the exists_qlib_data function fail due to the string "NA" being wrongly converted to the float "nan" but not a string.

To Reproduce

Steps to reproduce the behavior:

  1. Save the attached all.txt under the ~/.qlib/qlib_data/us_data/instruments.
  2. Run the following code:

    provider_uri = "~/.qlib/qlib_data/us_data_new"  # target_dir
    if not exists_qlib_data(provider_uri):
    print(f"Qlib data is not found in {provider_uri}")
    sys.path.append(str(scripts_dir))
    from get_data import GetData
    
    GetData().qlib_data(target_dir=provider_uri, region=REG_US)

Expected Behavior

The code should run without errors.

Screenshot

image

Environment

Note: User could run cd scripts && python collect_info.py all under project directory to get system information and paste them here directly.

Additional Notes

  1. The bug is caused by the wrong usage of pandas.read_csv in the following line of exists_qlib_data under qlib\utils\__init__.py. Refer to the page for more details.
    miss_code = set(pd.read_csv(_instrument, sep="\t", header=None).loc[:, 0].apply(str.lower)) - set(code_names)
  2. The cause of the bug can be further verified by the following code:
    temp = pd.read_csv("all.txt", sep="\t", header=None).loc[:, 0]
    non_string_values = [i for i in temp if not isinstance(i, str)]
    print(non_string_values)
    [nan]
  3. The bug can be easily fixed by adding keep_default_na=False
    temp = pd.read_csv("all.txt", sep="\t", header=None, keep_default_na=False).loc[:, 0]
    non_string_values = [i for i in temp if not isinstance(i, str)]
    print(non_string_values)
    []
  4. I can help with the fix, just want to ask what tests I need to run to make sure whether the fix would cause any other issues.
SunsetWolf commented 5 months ago

Would you like to create a PR to fix this and be one of the contributors to qlib.

OzzyXu commented 5 months ago

@SunsetWolf Sure. Then I will double-check whether my fix will cause any issues, if not, then I will create a PR to fix it. And I am happy to be a contributor to Qlib and try to help with other issues.

SarthakNikhal commented 5 months ago

@OzzyXu Let me know about it. I'd like to help.