sunpy / sunpy

SunPy - Python for Solar Physics
http://www.sunpy.org
BSD 2-Clause "Simplified" License
908 stars 583 forks source link

Data for GOES assumes it knows what satellites are available based on date range #3337

Closed Cadair closed 4 years ago

Cadair commented 5 years ago

Currently in XRSClient we hard code the operational dates of the goes satellite and use the highest number we can for they query. For some days in the operational range of the satellite the files might not exist on the server. We should be smart about how we choose satellite number.

aringlis commented 5 years ago

So do you suggest that the client should be aware of all the operational satellites that have data available for the chosen date(s), and fall back to another satellite if data cannot be found?

Related to this, it is possible to search using the a.goes.SatelliteNumber attribute to manually specify which satellite you want, but it does not seem to work as intended, e.g.:



In [69]: result                                                                                                                      
Out[69]: 
<sunpy.net.fido_factory.UnifiedResponse object at 0x7f8c38647eb8>
Results from 1 Provider:

2 Results from the XRSClient:
     Start Time           End Time      Source Instrument Wavelength
       str19               str19         str4     str4       str3   
------------------- ------------------- ------ ---------- ----------
2010-06-01 00:00:00 2010-06-01 23:59:59   nasa       goes        nan
2010-06-02 00:00:00 2010-06-02 23:59:59   nasa       goes        nan ```

The above should not work, as it is before GOES-15 data are available. But, the search result gives the impression that results are found. However:

```In [70]: Fido.fetch(result)                                                                                                          
Files Downloaded:   0%|                                                                                      | 0/2 [00:00<?, ?file/s]
Out[70]: 
<parfive.results.Results object at 0x7f8c3863aeb8>
[]
Errors:
(error(filepath_partial=<function Downloader.enqueue_file.<locals>.filepath at 0x7f8c4c7467b8>, url='https://umbra.nascom.nasa.gov/goes/fits/2010/go1520100602.fits', exception=FailedDownload()```

I truncated the error message, but you can see that when you try to actually download the GOES data, it does not exist. The same search and retrieval works correctly when GOES-14 is specified, as it should.
aringlis commented 5 years ago

The hardcoded dates for the GOES satellite operations also do not account for all the times that data are available. For example, 2016 and 2017 both have data available from GOES-13 and GOES-14 as well as GOES-15.

IIRC the dates are sourced from NOAA and represent in some way which spacecraft is 'primary'.

abhijeetmanhas commented 4 years ago

If for same date, we have multiple satellites with data, what should we do? Should we have one more attr as 'satellite_no' or something similar, and a default satellite no. (mentioned in docs); if data is not available , just give an error messsage or empty results.

Suggestions?

hayesla commented 4 years ago

there is an attrs for GOES - a.SatelliteNumber https://github.com/sunpy/sunpy/blob/master/sunpy/net/dataretriever/attrs/goes.py

If none if given it currently provides the data for the operational satellite number at the time which is hard-coded in.

What would be great to see is if you search a time range then data that is available for that range is provided - for example GOES 15 and GOES 13.

hayesla commented 4 years ago

I think NOAA also plans to release the GOES 13 data for the past solar cycle with the already available GOES 15 data.

abhijeetmanhas commented 4 years ago

I found a small hack. Since filelist of scraper already opens the directory (in this case the year) , so if we pass satellite_number = r'\d{2}' , I am getting every fits file those goes sats which are only available on those page (means we don't neet to use _get_goes_sat_num function .)

So this is intelligent enough and don't need hardcorded dates. @hayesla I will open a PR for it, if it is fine.

2) Other solution is making multiple scrapers (for every sat_number) and check. But this would be very slow, since 14 html pages, again and again. Which one should I implement?

Cadair commented 4 years ago

What would be great to see is if you search a time range then data that is available for that range is provided - for example GOES 15 and GOES 13.

If I remember correctly there are two major components to this:

1) Scraper needs to be able to handle matching and returning values in paths which are wildcard types, i.e "match all the satellite numbers". The values of these fields needs to be returned to the caller somehow as categorical data. (Probably along with time and everything else, which relates to a solution to #3715 ). 2) We need a way of displaying all this metadata about the URLs to the user of Fido. We don't want to implement this currently as the user would get duplicate results printed in the results of search() with no way to disambiguate them. For this component we need to allow dataretriever classes to specify what they want to display in their results tables, which is #3321 and also is related to #3368.

So in summary, a proper solution to this issue is tightly coupled with a lot of things, many of which I hope are covered by the scope of the GSOC project idea I wrote up.