sunpy / sunpy-soar

A sunpy plugin for accessing data in the Solar Orbiter Archive (SOAR).
https://docs.sunpy.org/projects/soar/
BSD 2-Clause "Simplified" License
18 stars 12 forks source link

Enchancing Metadata Support for SOAR #118

Closed NucleonGodX closed 4 months ago

NucleonGodX commented 6 months ago

This Pull Request is part of Google Summer of Code (GSoC) 2024, focused on enhancing metadata support for SOAR in sunpy-soar (as discussed in issue #46 ). This initial phase involves integrating attributes from the existing sunpy.net attribute system. Specifically, join methods have been developed, and the Detector attribute has been successfully implemented for remote-sensing instruments.

nabobalis commented 6 months ago

I'll review this in a bit more detail but can you add unit tests to ensure that detector works in a query as well as the construct methods create the expected queries?

NucleonGodX commented 6 months ago

I've added tests, let me know if its fine.

NucleonGodX commented 5 months ago

Based on my research and the insights provided by Eric, I have implemented the wavelength functionality. I believe we should discuss the details of this implementation and further changes needed in this week's meeting.

nabobalis commented 5 months ago

Can you add a changelog entry.

NucleonGodX commented 5 months ago
time = a.Time("2022-03-01 00:00", "2022-03-02 00:00")
level = a.Level(1)
product = a.soar.Product("EUI-FSI174-IMAGE")
detector= a.Detector("FSI")
result = Fido.search(instrument & time & level & product & detector)
print(result)

For some reason using detector= "FSI" in query results in an API error

raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 408 Client Error:  for url: http://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+h1.instrument,%20h1.descriptor,%20h1.level,%20h1.begin_time,%20h1.end_time,%20h1.data_item_id,%20h1.filesize,%20h1.filename,%20h1.soop_name,%20h2.detector,%20h2.wavelength,%20h2.dimension_index+FROM+v_sc_data_item%20AS%20h1%20JOIN%20v_eui_sc_fits%20AS%20h2%20USING%20(data_item_oid)+WHERE+h1.instrument='EUI'+AND+h1.begin_time%3E='2022-03-01+00:00:00'+AND+h1.begin_time%3C='2022-03-02+00:00:00'+AND+h2.dimension_index='1'+AND+h1.level='L1'+AND+h1.descriptor='eui-fsi174-image'+AND+h2.Detector='FSI'

if the query is done without taking detector in it, it gives normal response like for, result = Fido.search(instrument & time & level & product)

Even with an invalid detector in the query we get the normal 0 results output.

And for other detectors also it works perfectly fine for example

time = a.Time("2022-03-01 00:00", "2022-05-02 00:00")
detector= a.Detector("HRI_EUV")
result = Fido.search(instrument & time & level & detector)
22706 Results from the SOARClient:

Instrument     Data product    Level ... Detector Wavelength
                                     ...
---------- ------------------- ----- ... -------- ----------
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       ...                 ...   ... ...      ...        ...
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
       EUI eui-hrieuv174-image    L1 ...  HRI_EUV      174.0
nabobalis commented 5 months ago

So the website says:

<INFO name="QUERY_STATUS" value="ERROR"> Cannot parse query 'SELECT h1.instrument, h1.descriptor, h1.level, h1.begin_time, h1.end_time, h1.data_item_id, h1.filesize, h1.filename, h1.soop_name, h2.detector, h2.wavelength, h2.dimension_index FROM v_sc_data_item AS h1 JOIN v_eui_sc_fits AS h2 USING ' for job '1718213015729OPE': Encountered "<EOF>" at line 1, column 236. Was expecting: "(" ... </INFO>

Could we have formated something incorrectly?

Also we should really try to get this error messages out to the user in the future.

NucleonGodX commented 5 months ago

I think you opened the link generated upto the http://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+h1.instrument,%20h1.descriptor,%20h1.level,%20h1.begin_time,%20h1.end_time,%20h1.data_item_id,%20h1.filesize,%20h1.filename,%20h1.soop_name,%20h2.detector,%20h2.wavelength,%20h2.dimension_index+FROM+v_sc_data_item%20AS%20h1%20JOIN%20v_eui_sc_fits%20AS%20h2%20USING%20(data_item_oid)+WHERE+h1.instrument=

But the entire query link is: http://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+h1.instrument,%20h1.descriptor,%20h1.level,%20h1.begin_time,%20h1.end_time,%20h1.data_item_id,%20h1.filesize,%20h1.filename,%20h1.soop_name,%20h2.detector,%20h2.wavelength,%20h2.dimension_index+FROM+v_sc_data_item%20AS%20h1%20JOIN%20v_eui_sc_fits%20AS%20h2%20USING%20(data_item_oid)+WHERE+h1.instrument=%27EUI%27+AND+h1.begin_time%3E=%272022-03-01+00:00:00%27+AND+h1.begin_time%3C=%272022-03-02+00:00:00%27+AND+h2.dimension_index=%271%27+AND+h1.level=%27L1%27+AND+h2.Detector=%27FSI%27

which to me it gives

<INFO name="QUERY_STATUS" value="ERROR"> Job timeout/aborted. </INFO>
<INFO name="HttpErrorCode" value="408"/>
<INFO name="Action" value="Handling request"/>
nabobalis commented 5 months ago

I think you opened the link generated upto the http://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+h1.instrument,%20h1.descriptor,%20h1.level,%20h1.begin_time,%20h1.end_time,%20h1.data_item_id,%20h1.filesize,%20h1.filename,%20h1.soop_name,%20h2.detector,%20h2.wavelength,%20h2.dimension_index+FROM+v_sc_data_item%20AS%20h1%20JOIN%20v_eui_sc_fits%20AS%20h2%20USING%20(data_item_oid)+WHERE+h1.instrument=

But the entire query link is: http://soar.esac.esa.int/soar-sl-tap/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=json&QUERY=SELECT+h1.instrument,%20h1.descriptor,%20h1.level,%20h1.begin_time,%20h1.end_time,%20h1.data_item_id,%20h1.filesize,%20h1.filename,%20h1.soop_name,%20h2.detector,%20h2.wavelength,%20h2.dimension_index+FROM+v_sc_data_item%20AS%20h1%20JOIN%20v_eui_sc_fits%20AS%20h2%20USING%20(data_item_oid)+WHERE+h1.instrument=%27EUI%27+AND+h1.begin_time%3E=%272022-03-01+00:00:00%27+AND+h1.begin_time%3C=%272022-03-02+00:00:00%27+AND+h2.dimension_index=%271%27+AND+h1.level=%27L1%27+AND+h2.Detector=%27FSI%27

which to me it gives

<INFO name="QUERY_STATUS" value="ERROR"> Job timeout/aborted. </INFO>
<INFO name="HttpErrorCode" value="408"/>
<INFO name="Action" value="Handling request"/>

Ah I just copied the url incorrectly.

@ebuchlin or @hayesla, if we hit a timeout like this, who can we contact to ask about it?

ebuchlin commented 5 months ago

We already had similar non-expected timeout issues for different (simpler) requests, and they have been solved on SOAR side. I tried to write the join condition explicitly (h1.data_item_oid=h2.data_item_oid). If I limit the number of results explicitly (SELECT TOP 10) I get something but it still takes too long (>30s). So I guess we should ask SOAR (I can do it).

herroyalmaj commented 5 months ago

Hi guys, you ask me 😊 Also I'd be really grateful if you could nudge me when there are things missing from the SOAR help pages - they need an overhaul 😢 I'm looking into this timeout, but instead of v_eui_sc_fits.detector (which only gives HRI_LYA, HRI_EUV or FSI), you could use v_sc_data_item.sensor (which contains at least fsi174, fsi304, hrieuv174, hrilya1216)? However, right now, even "SELECT DISTINCT sensor FROM v_sc_data_item WHERE instrument='EUI'" times out, so I'll get onto the development team. Thanks, H

herroyalmaj commented 5 months ago

For example:

SELECT h1.instrument, h1.descriptor, h1.level, h1.begin_time, h1.end_time,
h1.data_item_id, h1.filesize, h1.filename, h1.soop_name, h1.sensor,
h2.wavelength, h2.dimension_index
FROM v_sc_data_item AS h1
JOIN v_eui_sc_fits AS h2 USING (data_item_oid)
WHERE h1.instrument='EUI' AND h1.begin_time>='2022-03-01T00:00:00'
AND h1.begin_time<='2022-03-02T00:00:00' AND h2.dimension_index='1'
AND h1.level='L1' AND h1.sensor LIKE 'FSI%'

gave me the 180 results I was expecting.

herroyalmaj commented 5 months ago

Hi again, I should have mentioned that I'm the Solar Orbiter Archive Scientist. The v_sc_data_item.sensor is indexed, and so works better than v_eui_sc_fits. Also, there is a ticket to fix the lack of end time in the v_eui_sc_fits view. I can prioritise it now I know you need it.

hayesla commented 5 months ago

Hi @herroyalmaj thanks for this info!

Any idea what the timeout issue is?

hayesla commented 5 months ago

for SPICE, the wavemin/wavemax only has the first window (even if there is data in the file from other windows), hence it would be unclear and potentially misleading to a user if we return this information. Hence what we should do for SPICE is that this information is not returned from the table and a user cannot search over these.

ebuchlin commented 5 months ago

Hi again, I should have mentioned that I'm the Solar Orbiter Archive Scientist. The v_sc_data_item.sensor is indexed, and so works better than v_eui_sc_fits. Also, there is a ticket to fix the lack of end time in the v_eui_sc_fits view. I can prioritise it now I know you need it.

Thanks for the advice to use this indexed column. Maybe indexed columns could be indicated in the Tables, Views, and Columns help page? I understand that timeouts can be expected when using WHERE conditions on non-indexed columns.

hayesla commented 4 months ago

@NucleonGodX I think the last thing to do is to not return the PHI wavelengths now until we resolve the issue with the A vs nm. So I think we should just not return it in this PR. And similarly we should not return this for SPICE. Like we discussed on our call yesterday/

So for these both, we should maybe not even return the wavelength column. Let me know if this isnt clear.

Apart from these this PR is good to go!

nabobalis commented 4 months ago

Thanks for the reviews @ebuchlin and @hayesla.

I will run the CI and merge if it passes.

nabobalis commented 4 months ago

Thanks for the first stage of the project @NucleonGodX, you can now rebase the other two PRs to remove the commits that came from this PR.