ioos / ioosngdac

IOOS National Glider Data Assembly Center (V2)
https://ioos.github.io/ioosngdac/
8 stars 18 forks source link

How to tell if an aggregation was updated #58

Closed kwilcox closed 8 years ago

kwilcox commented 9 years ago

I've found it difficult to determine when data from a glider deployment has became available or been updated. The individual dive files are not available for download (that I could find) and scraping each deployment page looking for new or changed files seemed hacky and prone to break. I fell back to querying the DAP time variable for new timesteps. Not a terrible solution but it does not allow me to determine what else, if anything, has changed in the aggregation since the last time I checked (for example: an old dive file was manually corrected and re-uploaded).

This has been an issue with the DAP spec for as long as it has been around (no checksums). Since the glider-dac already does some magic before aggregating the dives and making them available through DAP, could a checksum be recomputed for each deployment every time it changes and added as a global attribute? Without this I'm forced to download every glider deployment in the DAC (fairly often) just to be sure that I'm not missing any new or corrected data. With a checksum, someone could very quickly determine if anything in the deployment has changed.

If there is a better way to do what I'm suggesting please let me know, thanks!

kerfoot commented 9 years ago

I typically do this by hitting the RSS feed for each deployment:

http://data.ioos.us/gliders/erddap/rss/UCSC294-20150430T2218.rss

You can get the list of RSS feed urls by hitting this link:

http://data.ioos.us/gliders/erddap/tabledap/allDatasets.json

John

On May 28, 2015, at 2:44 PM, Kyle Wilcox notifications@github.com wrote:

I've found it difficult to determine when data from a glider deployment has became available or been updated. The individual dive files are not available for download (that I could find) and scraping each deployment page looking for new or changed files seemed hacky and prone to break. I fell back to querying the DAP time variable for new timesteps. Not a terrible solution but it does not allow me to determine what else, if anything, has changed in the aggregation since the last time I checked (for example: an old dive file was manually corrected and re-uploaded).

This has been an issue with the DAP spec for as long as it has been around (no checksums). Since the glider-dac already does some magic before aggregating the dives and making them available through DAP, could a checksum be recomputed for each deployment every time it changes and added as a global attribute? Without this I'm forced to download every glider deployment in the DAC (fairly often) just to be sure that I'm not missing any new or corrected data. With a checksum, someone could very quickly determine if anything in the deployment has changed.

If there is a better way to do what I'm suggesting please let me know, thanks!

— Reply to this email directly or view it on GitHub.

kwilcox commented 9 years ago

@kerfoot this is perfect, thanks

kwilcox commented 9 years ago

@kerfoot @lukecampbell @kknee The RSS feeds are showing updated timestamps for every deployment. They are constantly updated even though there is no new data.

Example: http://data.ioos.us/gliders/erddap/rss/ru24-20130122T1943.rss, yet the dataset hasn't been actually updated in years.

kerfoot commented 9 years ago

@kwilcox : The tag will tell you. Anything aside from:

The dataset was reloaded.

means the dataset has been modified (data added, removed, etc), not just reloaded.

kwilcox commented 9 years ago

@BobSimons Is this something you would be willing to make available elsewhere in ERDDAP? Maybe a column in the http://data.ioos.us/gliders/erddap/tabledap/allDatasets.json response called last_modified or last_updated? If I'm missing a way to extract this information from ERDDAP please let me know!

BobSimons commented 9 years ago

In ERDDAP, you should never have to scrape a web page. The information is almost always available in more computer program/script friendly formats.


Since you want the different values for the different trajectories, I think what you want is a request like this (on my ERDDAP): http://coastwatch.pfeg.noaa.gov/erddap/tabledap/scrippsGliders.htmlTable?trajectory,time&time%3E=now-2days&orderByMax(%22trajectory,time%22) If you want, you can change the file extension to something more suitable for parsing (e.g., .csv, .htmlTable, .json, .mat, .nc, .tsv, .xhtml). Note that this request triggers an updateEveryNMillis event. So if you are using that, and a file has changed in the last nMillis, the changed file will be detected. So this request will be accurate as of (at worst) nMillis in the past. Note that that result doesn't tell you which deployments have changed since some previous point in time -- you have to keep track of the previous values.

Does that solve the problem? If not, please tell me why not, so I can look for another answer. (There are other approaches.)


Re: lack of checksums. True. But if you have set to true for this dataset, there is instantaneous access to the files for the dataset, including lastModified (most important) and size. E.g., this link shows you the recent files in this dataset, sorted by ascending lastModified http://coastwatch.pfeg.noaa.gov/erddap/files/scrippsGliders/batch3/?C=M;O=A (Yes, you would have to parse this format.) But I suspect the answer above will be easier/better.

On Mon, Aug 31, 2015 at 11:31 AM, Kyle Wilcox notifications@github.com wrote:

@BobSimons https://github.com/BobSimons Is this something you would be willing to make available elsewhere in ERDDAP? Maybe a column in the http://data.ioos.us/gliders/erddap/tabledap/allDatasets.json response called last_modified or last_updated? If I'm missing a way to extract this information from ERDDAP please let me know!

— Reply to this email directly or view it on GitHub https://github.com/ioos/ioosngdac/issues/58#issuecomment-136455422.

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

kwilcox commented 9 years ago

@BobSimons Thanks for the reply, very informative. Your first solution would work in the case where I was only looking for "new" data being appended to the time variable, but I am also looking to capture updates to the dataset (for example, qa/qc refinements for past timesteps).

Your second approach would work, except the Glider DAC updates the NetCDF files that feed ERDDAP every 30 minutes or so, so the file modified time may change while the MD5 sum of the file is actually identical.

Using the RSS feed and checking if the "the dataset was reloaded" is not present in the description seems to be my only way to gather the information... unless you have more tricks to teach?

BobSimons commented 9 years ago

You want to know if there have been changes to the underlying source files? Look at the underlying source files.

The aggregation process ignores a lot of information. For example, every source file may have a global attributed of "x", but only 1 value makes it to the aggregated dataset, and it may be fixed in datasets.xml. By their nature, aggregated datasets deal with the aggregation. If you want to know if the content of individual source files has changed, look at the source files.

On Mon, Aug 31, 2015 at 1:27 PM, Kyle Wilcox notifications@github.com wrote:

@BobSimons https://github.com/BobSimons Thanks for the reply, very informative. Your first solution would work in the case where I was only looking for "new" data being appended to the time variable, but I am also looking to capture updates to the dataset (for example, qa/qc refinements for past timesteps).

Your second approach would work, except the Glider DAC updates the NetCDF files that feed ERDDAP every 30 minutes or so, so the file modified time may change while the MD5 sum of the file is actually identical.

Using the RSS feed and checking if the "the dataset was reloaded" is not present in the description seems to be my only way to gather the information... unless you have more tricks to teach?

— Reply to this email directly or view it on GitHub https://github.com/ioos/ioosngdac/issues/58#issuecomment-136489537.

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

kwilcox commented 9 years ago

Closing, the RSS approach, while hacky, will work.

kerfoot commented 9 years ago

The only thing I'd add to this approach is that you need to check within the specified ERDDAP dataset scan interval. For example, if a dataset is changed and you hit the RSS feed prior to the next scan time, you will see something other than 'Dataset reloaded'. But, the next time ERDDAP scans and no changes were made to the underlying files, you'll get the 'Dataset reloaded' message.

kwilcox commented 9 years ago

@kerfoot Oh boy. Theoretically someone else could generate the "changed" message everytime and I could always see the "dataset reloaded" message and never know anything changed. For now I will just download every dataset everytime, but that's lots of extra bandwidth going to and from AWS (your hosting) that could be avoided if something could tell me when the underlying files were actually updated.

BobSimons commented 9 years ago

I think there may be some misunderstanding about the RSS feed. It always shows the latest changed information. A request to RSS doesn't reset that info.

I see how RSS might not give you what you want. So don't use it. Use the subscription system instead and get an email/ping every time there is a change to the dataset. The subscription system is vastly more efficient (since 1 change -> 1 email/ping) than RSS (where you must poll frequently if you want to try to get ever message (and even then there is no assurance you won't miss a message).

I'm clearly having a hard time understanding exactly what you want. Your request was vague because I don't know the details of the dataset, how it works, what you are looking for, or what is an "acceptable" way for you to request the info. If a phone call would help, call me.

On Mon, Aug 31, 2015 at 2:00 PM, Kyle Wilcox notifications@github.com wrote:

@kerfoot https://github.com/kerfoot Oh boy. Theoretically someone else could generate the "changed" message everytime and I could always see the "dataset reloaded" message and never know anything changed. For now I will just download every dataset everytime, but that's lots of extra bandwidth going to and from AWS (your hosting) that could be avoided if something could tell me when the underlying files were actually updated.

— Reply to this email directly or view it on GitHub https://github.com/ioos/ioosngdac/issues/58#issuecomment-136497861.

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

kwilcox commented 9 years ago

I submitted two PRs to different repos to address this issue: https://github.com/ioos/glider-dac/pull/65, https://github.com/ioos/ioosngdac/pull/86