sunpy / drms

Access HMI, AIA and MDI data with Python from public JSOC DRMS servers
https://docs.sunpy.org/projects/drms/en/stable/
BSD 2-Clause "Simplified" License
22 stars 22 forks source link

Incorrect interpretation of Quality flag unsigned v signed int #37

Open samaloney opened 4 years ago

samaloney commented 4 years ago

So it looks like drms is interpreting the quality flag as unsigned int but the backend service views it as a signed 32 bit int.

from drms.client import Client
c = Client()
res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI]',
                         ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY'])
print(res)
                     T_OBS TELESCOP      INSTRUME     QUALITY
0  2017.09.06_05:36:04_TAI  SDO/HMI  HMI_COMBINED           0
1  2017.09.06_05:48:04_TAI  SDO/HMI  HMI_COMBINED           0
2  2017.09.06_06:00:04_TAI  SDO/HMI  HMI_COMBINED       65536 (0x00010000 from web interface)
3                  MISSING  SDO/HMI       MISSING  3221225472 (0xc0000000 from web interface)

So if I add [? QUAILTY >= 0 ?] I would expect to get the same results?

res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY >= 0?]',
                         ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY'])
print(res)
                     T_OBS TELESCOP      INSTRUME  QUALITY
0  2017.09.06_05:36:04_TAI  SDO/HMI  HMI_COMBINED        0
1  2017.09.06_05:48:04_TAI  SDO/HMI  HMI_COMBINED        0
2  2017.09.06_06:00:04_TAI  SDO/HMI  HMI_COMBINED    65536

Ok what if I try [? QUAILTY < 0 ?]?

res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY < 0?]',
                          ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY'])
print(res)
     T_OBS TELESCOP INSTRUME     QUALITY
0  MISSING  SDO/HMI  MISSING  3221225472

The only way I can make sense of this is if Quality is a signed 32 bit int as

q1 = 3221225472
np.frombuffer(q1.to_bytes(32, 'little'), count=1, dtype=np.int32)
array([-1073741824], dtype=int32)

and

q2 = 65536
np.frombuffer(q2.to_bytes(32, 'little'), count=1, dtype=np.int32)
array([65536], dtype=int32)
samaloney commented 4 years ago

Looks like it could be handled with a special case here https://github.com/sunpy/drms/blob/d4ab3b35adba43123dd2b130c05450305f07e5ab/drms/client.py#L672-L678

mbobra commented 4 years ago

Agree; drms treats QUALITY as int64 when it should treat it as int32.

This is the query in the example above:

from drms.client import Client
c = Client()
res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI]', ['T_REC', 'TELESCOP', 'QUALITY'])
res gives the following result (notice that the T_REC value of 2017.09.06_06:12:00_TAI corresponds to a QUALITY value of 3221225472):  T_REC TELESCOP QUALITY
2017.09.06_05:36:00_TAI SDO/HMI 0
2017.09.06_05:48:00_TAI SDO/HMI 0
2017.09.06_06:00:00_TAI SDO/HMI 65536
2017.09.06_06:12:00_TAI SDO/HMI 3221225472

However, as @samaloney pointed out, if we query for records with a QUALITY value less than zero, then the same T_REC shows up:

res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY < 0 ?]', ['T_REC', 'TELESCOP', 'QUALITY'])
T_REC  TELESCOP QUALITY
2017.09.06_06:12:00_TAI SDO/HMI 3221225472

So QUALITY should be cast as int32:

res['QUALITY'][0].astype('int64')
> 3221225472

res['QUALITY'][0].astype('int32')
> -1073741824
kbg commented 4 years ago

The Python client just converts string representations of integers into actual integers (without any sign conversion). In the case of HMI QUALITY keywords the strings contain (unsigned) hexadecimal integers and need some special handling because Pandas apparently only supports converting strings containing decimal numbers.

The QUALITY keyword for HMI data defines a number of flags that can be represented by an unsigned 32-bit integer, where each bit represents a certain condition. For example: Having the highest bit (0x80000000) set for a record means that there is no data for the record, while having the 6th lowest bit (0x00000020) set means that the image is of poor quality, or having the 9th lowest bit (0x00000100) set means that an eclipse has occurred for at least one Level 1 images.

This means that one can check for missing data with

res['QUALITY'] & 0x80000000 != 0

or for images where an eclipse has occurred with

res['QUALITY'] & 0x00000100 != 0

An extensive list of additional HMI Level 1.5 flags is available at:

http://jsoc.stanford.edu/jsocwiki/Lev1qualBits

Unfortunately JSOC introduced an inconsistent way of querying this keyword by interpreting the QUALITY bit field as signed (two's-complement) 32-bit integers, while returning unsigned hexadecimal numbers as a result. The signed number used in queries does not really have any special meaning, it is still just a combination of bits, where setting the highest bit results in the number being negative, which is why you can select all records with existing images using [? QUALITY >= 0 ?].

@mbobra already pointed out that you could convert the QUALITY using the astype() method. To change this for the whole QUALITY column in the results, you can, for example, write something like:

res['QUALITY'] = res['QUALITY'].astype('int32')

But I'm not sure if it would be a good idea to do this by default. As far as I can see, the only reason for this would be to check with

res['QUALITY'] < 0

and

res['QUALITY'] >= 0

if the QUAL_NODATA bit (0x80000000) is set or unset. But this can also be done using

res['QUALITY'] & 0x80000000 != 0

and

res['QUALITY'] & 0x80000000 == 0

respectively, which makes more sense in my opinion, because the QUALITY keyword is just a combination of flags, and should not be (mis-)interpreted as a number.

Another issue is that this could also introduce some other inconsistencies: The actual results from JSOC are unsigned integers, which you can see by skipping the conversion for the QUALITY keyword:

res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI]',
              key='T_REC, QUALITY', skip_conversion='QUALITY')

=>
                     T_REC     QUALITY
0  2017.09.06_05:36:00_TAI  0x00000000
1  2017.09.06_05:48:00_TAI  0x00000000
2  2017.09.06_06:00:00_TAI  0x00010000
3  2017.09.06_06:12:00_TAI  0xc0000000

Here the last entry is consistent with

hex(3221225472) == '0xc0000000'

while the negative signed integer would also result in a negative hexadecimal number:

hex(-1073741824) == '-0x40000000'

Maybe we could just update the documentation and point out the inconsistency between [? QUALITY < 0 ?] queries and the results returned by JSOC (which, btw, also affects people who use the Lookdata web interface), and perhaps give a short example on how to convert the QUALITY column to signed 32-bit integers.

What do you think?

samaloney commented 4 years ago

Thanks for the quick response @mbobra and @kbg! I probably should have linked to the original issue which that stared all this which was to ignore/skip missing data when trying to download JSOC data using SunPy FIDO client which is build on drms ( https://github.com/sunpy/sunpy/issues/3735). I remembered our previous conversation about using adding [? QUALITY >= 0?] to the query, as far as I knew the meaning of the quality flags is not identical across all series so use what ever flags are needed. For example as far as I can tell from here for AIA missing data flag is bit 5?

Unfortunately JSOC introduced an inconsistent way of querying this keyword by interpreting the QUALITY bit field as signed (two's-complement) 32-bit integers, while returning unsigned hexadecimal numbers as a result. The signed number used in queries does not really have any special meaning, it is still just a combination of bits, where setting the highest bit results in the number being negative, which is why you can select all records with existing images using [? QUALITY >= 0 ?].

Yea this was the source of a lot of the confusion I didn't see how an unsigned 32-bit integer could be less than zero when doing something like [? QUALITY < 0?] so how could it return any results. I think updating the documentation to indicate that the number in such a query is interpreted as a signed 32 bit int with some examples would most definitely help.

What operations are supported inside [? ?] is this documented anywhere? From playing around looks like <, <=, ==, !=, >, >= I've tried QUALITY & 0x80000000 and QUALITY & 0x80000000 == 0 but this doesn't seem to work? If it's not possible to check individual bits I think there are some ambiguities as to whether to use >, < etc.

For example example lets say I want to filter out missing data 0x80000000 or -2147483648 as signed int.

res = c.query(
    'hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY > -2147483648 ?]',
    ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY']
)

print(res)
                     T_OBS TELESCOP      INSTRUME     QUALITY
0  2017.09.06_05:36:04_TAI  SDO/HMI  HMI_COMBINED           0
1  2017.09.06_05:48:04_TAI  SDO/HMI  HMI_COMBINED           0
2  2017.09.06_06:00:04_TAI  SDO/HMI  HMI_COMBINED       65536
3                  MISSING  SDO/HMI       MISSING  3221225472

Now I know why this happens the additional flag 0x40000000 is set so the combined flags are 0xc0000000 or -1073741824 which is greater than -2147483648. No using -1073741824 works as expected.

res = c.query(
    'hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY > -1073741824 ?]',
     ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY']
)

print(res)
                     T_OBS TELESCOP      INSTRUME  QUALITY
0  2017.09.06_05:36:04_TAI  SDO/HMI  HMI_COMBINED        0
1  2017.09.06_05:48:04_TAI  SDO/HMI  HMI_COMBINED        0
2  2017.09.06_06:00:04_TAI  SDO/HMI  HMI_COMBINED    65536

As by by definition for HMI [? QUALITY < 0 ?] - some form of missing data [? QUALITY == 0 ?] - no data issues [? QUALITY > 0 ?] - some quality issues but is this the case for all series if not I don't see how one can figure out if you need > or < ahead of time?

samaloney commented 4 years ago

A specific question how to query for HMI data that does not have the QUAL_NODATA (0x80000000) or QUAL_LOWINTERPNUM (0x00010000) bits set but may have other bits set? [? QUALITY < 65536?] doesn't seem to work as it return the missing data and [? QUALITY = 0?] isn't exactly the same.

qval = np.int32(0x00010000)
print(qval)
65536

res = c.query('hmi.B_720s[2017.09.06_05:40:00_TAI-2017.09.06_06:30:00_TAI][? QUALITY < 65536 ?]',
              ['T_OBS', 'TELESCOP', 'INSTRUME', 'QUALITY'])

print(response)

                     T_OBS TELESCOP      INSTRUME     QUALITY
0  2017.09.06_05:36:04_TAI  SDO/HMI  HMI_COMBINED           0
1  2017.09.06_05:48:04_TAI  SDO/HMI  HMI_COMBINED           0
2                  MISSING  SDO/HMI       MISSING  3221225472