noaa-oar-arl / monetio

The Model and ObservatioN Evaluation Tool I/O package
https://monetio.readthedocs.io
MIT License
17 stars 29 forks source link

AERONET parallel different dates #100

Closed zmoon closed 1 year ago

zmoon commented 1 year ago

Currently you get extra day of data if you use parallel cf. serial. I made a note about this sometime ago https://github.com/noaa-oar-arl/monetio/blob/9c817659676fa1533d135547fc65fb0b00211f76/monetio/obs/aeronet.py#L154-L155

zmoon commented 1 year ago

Started addressing this and discovered something weird. Splitting a two day request we get less data.

import pandas as pd

from monetio import aeronet

# One request
t = pd.to_datetime(["2019/09/01", "2019/09/02", "2019/09/03"])
df1 = aeronet.add_data(t)
assert not df1.duplicated().any()

# Split
df2a = aeronet.add_data(t[:2])
df2b = aeronet.add_data(t[1:])
assert not df2a.duplicated().any()
assert not df2b.duplicated().any()

assert len(df1) > len(df2a) + len(df2b)

# Which rows are only in the one-request results?
df2 = pd.concat([df2a, df2b], ignore_index=True)
df_all = df1.merge(df2, on=["siteid", "time"], how="left", indicator=True)
assert df_all._merge.value_counts()["right_only"] == 0
df_ = df_all.query("_merge == 'left_only'")
print(len(df_), "rows unique to the single request version")
print(df_.time.min(), "...", df_.time.max())
print(sorted(df_.siteid.unique()))
Reading Aeronet Data...
Reading Aeronet Data...
Reading Aeronet Data...
670 rows unique to the single request version
2019-09-02 00:00:10 ... 2019-09-02 03:47:57
['ARM_SGP', 'Bakersfield', 'CalTech', 'Cascade_Airport', 'Cliff_Creek_1', 'Cliff_Creek_2', 'Cliff_Creek_3', 'Cliff_Creek_4', 'Cliff_Creek_5', 'Cliff_Creek_6', 'Fort_McMurray', 'Fresno_2', 'Grizzly_Bay', 'Kelowna_UAS', 'Kluane_Lake', 'MAXAR_FUTON', 'McCall_AB_Standard', 'McCall_Dragon_1', 'McCall_Dragon_3', 'McCall_Dragon_4', 'McCall_Dragon_5', 'McCall_Dragon_6', 'McCall_Dragon_8', 'Meridian_DEQ', 'Missoula', 'Missoula_Health_Dpt', 'Missoula_Pt_Six', 'Missoula_Waterworks', 'Monterey', 'NASA_Ames', 'NEON_BONA', 'NEON_CLBJ', 'NEON_CVALLA', 'NEON_HEAL', 'NEON_MOAB', 'NEON_NIWO', 'NEON_OAES', 'NEON_ONAQ', 'NEON_SJER', 'NEON_Sterling', 'NEON_TOOL', 'NEON_WOOD', 'NEON_WREF', 'NEON_YELL', 'PNNL', 'Pinehurst_Idaho', 'Railroad_Valley2', 'Red_Mountain_Pass', 'Rexburg_Idaho', 'Rimrock', 'SDSU_IPLab', 'Saturn_Island', 'TABLE_MOUNTAIN_CA', 'Taylor_Ranch_TWRS', 'UACJ_UNAM_ORS', 'Univ_of_Houston', 'Univ_of_Lethbridge', 'Univ_of_Nevada-Reno', 'White_Sands_HELSTF']

Corresponding URLs:

32302 - (15224 + 16415) = 663. This seems to indicate that this is a problem with the web service not the reader. But accounting for the header lines we would get 669, not 670, so not quite consistent with the reader results (but maybe my line count is off by one).

zmoon commented 1 year ago

Ilya from AERONET has fixed the above issue. Seems that a few rows were missing from the beginning of each AERONET request result, but now fixed.