spedas / pyspedas

Python-based Space Physics Environment Data Analysis Software
https://pyspedas.readthedocs.io/
MIT License
147 stars 58 forks source link

Bug in pyspedas.kyoto.dst for Dst data downloading #755

Closed HongfanChen closed 6 months ago

HongfanChen commented 6 months ago

The Dst data on Kyoto website is not space separated when Dst values has three digits or more (which is the case for large storms). The current parse_html function use str.split() method so this causes the problem. It can be fixed by using re.findall. See the modified version here. This may not be the best solution but it works for me.

def parse_html(html_text, year=None, month=None):
    """
    Parses the HTML content to extract relevant information.

    Parameters
    ----------
    html_text : str
        The HTML content to parse.
    year : int, optional
        The year to consider while parsing the HTML content. If not provided, all years are considered.
    month : int, optional
        The month to consider while parsing the HTML content. If not provided, all months are considered.

    Returns
    -------
    dict
        A dictionary containing the parsed information.
    """
    times = []
    data = []
    # remove all of the HTML before the table
    html_data = html_text[html_text.find("Hourly Equatorial Dst Values") :]
    # remove all of the HTML after the table
    html_data = html_data[: html_data.find("<!-- vvvvv S yyyymm_part3.html vvvvv -->")]
    html_lines = html_data.split("\n")
    data_strs = html_lines[5:]
    # loop over days
    for day_str in data_strs:
        # the first element of hourly_data is the day, the rest are the hourly Dst values
#         hourly_data = day_str.split()
        hourly_data = re.findall(r'[-+]?\d+', day_str)
        if len(hourly_data[1:]) != 24:
            continue
        for idx, dst_value in enumerate(hourly_data[1:]):
            times.append(
                time_double(
                    year + "-" + month + "-" + hourly_data[0] + "/" + str(idx) + ":30"
                )
            )
            data.append(float(dst_value))

    return (times, data)
jameswilburlewis commented 6 months ago

Thanks for letting us know and proposing a patch! Could you give us an example time range where the problem occurs, so we can add it to our test suite?

HongfanChen commented 6 months ago

Thanks for letting us know and proposing a patch! Could you give us an example time range where the problem occurs, so we can add it to our test suite?

Sure no problem. Try this 2015-03-17 storm during ["2015-03-16": "2015-03-19"]. Two lines of data corresponding to Mar 17 and Mar 18 will be missing if the original code is applied. You can check the data from the website here as well: https://wdc.kugi.kyoto-u.ac.jp/dst_final/201503/index.html

jameswilburlewis commented 6 months ago

Fix released with pyspedas 1.5.5, now available at pypi