Make a decision on catalogRef following in EMMA harvesting

dpsnowden commented 10 years ago

We have been dealing with the technical debt associated with one decision for quite some time and it has introduced a whole bunch of extra work on the part of the RAs and the registry/catalog team. I want a discussion that we can all understand and that will end up with a decision to either maintain statues quo or change our approach.

The decision as I understand it was to force the ncISO/EMMA harvesting process to not follow catalogRef elements in THREDDS catalogs. As I understand it this decision was made at the EMMA level, and is not a technical limitation in ncISO.

I'd like to challenge this decision and have a discussion about the pros and cons of turning off the limitation.

The catalogRef element is important to one of the most attractive features of the THREDDS Data Server, namely the ability to present a unified catalog of services and data sets without having to bother the user or intermediate data managers with the complexity of the underlying organization. A region or other data provider can create a virtual aggregation without having to centralize the data. They can manage the complexity of the organization in one place, the THREDDS catalog.xml file. By requiring them to register sub-catalogs we are now asking them to manage the complexity in two places. First, the catalog itself, and second, in the list of sub-catalogs that independently show up in a registration email, in the EMMA collection source table, in the geoportal and in the WAF etc. This seems to be introducing fragility to a system that should be pretty robust. What is missing????

What can we do about this?

It seems like the main argument for turning off catalogRef following is the risk of many records that are not intended to be harvested. I'm guessing there is a different way to deal with this.
Does anyone have an example of harvesting one or more IOOS THREDDS catalogs where they did follow catalogRef so that we can quantitatively compare the results?

Thoughts?

geoneubie commented 10 years ago

It seems like the main argument for turning off catalogRef following is the risk of many records that are not intended to be harvested. I'm guessing there is a different way to deal with this.

If THREDDs catalogs are well curated then it solves the problem of loading unintended datasets into the IOOS catalog. This is often not the case.

Does anyone have an example of harvesting one or more IOOS THREDDS catalogs where they did follow catalogRef so that we can quantitatively compare the results?

The second issue is scale-ability and the length of time to process higher volumes of data. While there are some options here, one solution which has been suggested and we support is RAs managing their own ncISO extractions from THREDDS to WAF, and EMMA focuses on harvesting from RA WAFs.

daf commented 10 years ago

If you're going to invest the energy to curate a WAF, why wouldn't you be investing that energy to curate your THREDDS catalog? It's very similar, and introduces one less step in the pipeline of "I have data" to "Other people can find my data". The less moving parts here, the less there is to break.

dpsnowden commented 10 years ago

So @geoneubie , your first point about well curated THREDDS catalogs sounds like something we can work toward with a better definition of "well curated". Is it possible to write down a reasonable description of a well curated catalog or is this a fools errand?

Today the extent of the guidance that we have written down is here https://github.com/ioos/registry#what-if-my-thredds-catalogs-contain-catalogref-elements. It's not very helpful and all it says is don't submit top level catalogs, only submit leaf node catalogs. That's very different than saying, if your catalog looked like this awesome catalog, then we'd harvest from the topmost level and follow all catalogRefs. UAF has done a lot to automate this but I don't feel like the logic that is written in to the catalog cleaner has ever been written down.

ebridger commented 10 years ago

@daf Re: why curate a WAF? At NERACOOS we have a catalog advertising two services on a single file: OPeNDAP and ncSOS. E.g. B01 Met file For some reason, not sure if the reason was ever determined, the harvest to EMMA process did not pick up the SOS service. ncISO does and using the WAF import to EMMA solved the problem.

rsignell-usgs commented 10 years ago

@dpsnowden , I agree that ignoring catalogRefs was the wrong way to fix the problem of harvesting too many (or unintentional) datasets. When I harvest metadata from THREDDS catalogs for the geoportal server we run here at USGS, I let ncISO follow the <catalogRef> links.

geoneubie commented 10 years ago

@dpsnowden - Well curated WAF to me means 1) Uses aggregations appropriately, individual data files than can be aggregated should be 2) Has an organized hierarchical structure, in the IOOS context don't mix data content. If every RA has a top level folder called IOOS_catalog that only contains datasets for IOOS that's good. 3) Contains useful titles

dpsnowden commented 10 years ago

@ebridger see #36 for a discussion more related to your question.

geoneubie commented 10 years ago

@daf Less moving parts for you doesn't mean less moving parts for others. Things sometimes break at the RA level, webapp containers go down, networks go down, folks introduce unsafe XML characters. All of this has the potential to break or significantly SLOW down centralized processing of these distributed datasets. If RAs deliver WAFs then RAs can choose appropriate timeframes to perform synchronizations with their THREDDs servers. If RAs are using multiple service providers e.g. 52North SOS and THREDDS and some new shiny service, then RAs can ensure there are no duplicate records for their metadata as opposed to pushing this task to an external centralized processed.

geoneubie commented 10 years ago

Friday RANT - in resource constrained (or even unconstrained) environments everyone wants to push problems down stream to 'make it easy' while automating where we can makes sense, that doesn't make difficult problems go away. As far as I know there are still lots of problems in UAF related to lack of aggregation, poor titles, and the need for greater human curation.

Have a good weekend, I feel better already!

daf commented 10 years ago

@geoneubie I'm actually on both sides of the GeoPortal - I consume from it, but I'm also in charge of managing registrations for all the data my company is responsible for, so when I'm arguing for ease, I'm arguing for myself too.

Your points are certainly valid re services going down (and can be made also for a WAF going down), and that RAs can be in control of curation, but I'm also not arguing for the removal of the WAF capability - if an RA wants to do that, go for it. I just don't think it should be a requirement when the ability does/should exist to go straight from the source.

rsignell-usgs commented 10 years ago

So we all agree we have a lot of messy THREDDS catalogs. But eliminating catalogRef doesn't address the problem at all. It doesn't deal with messy data, it just accesses less data.

It's like the blueberries I pulled out of the fridge this morning. There were a bunch of moldy ones. I could make a rule that I'm only going to eat the top layer. But there were moldy ones there in the top layer, just less of them. And I'd be eliminating all the rest of the good berries for no reason.

Hey @geoneubie, I feel better too! :stuck_out_tongue_winking_eye:

jcothran commented 10 years ago

The thing which I like about a WAF-default approach is that it moves the discussion outside of a more specific thredds-development-land to a more general document discussion which might include resources other than thredds. Agree it would be nice to only list one WAF or thredds endpoint to spider-down from, but have accepted that the noise-to-signal ratio is probably too high for this to work in the practical case.

rsignell-usgs commented 10 years ago

This issue is not about the benefits of WAF -- I think we all agree on that. And it's not an argument to submit one only one THREDDS catalog either.

What we are asking for here is just that ncISO harvest all the datasets represented in the THREDDS catalogs that are submitted. Folks who have top level catalogs that reference both granule and aggregation datasets would still be encouraged to submit child catalogs that reference only aggregation catalogs instead. But folks who had a single master aggregation catalog that had catalogRefs to a bunch of other aggregation catalogs could submit these for harvesting.

rsignell-usgs commented 10 years ago

I'd hate to see this discussion just die with no resolution. @dpsnowden, what do you think the next step should be?

geoneubie commented 10 years ago

Complete harvesting of THREDDS can be supported, issues that are likely to arise:

1) Length of time to process. Currently we process IOOS RA providers in a synchronous manner, we could change this to run harvests in parallel, but we still should think about harvest frequency. At the end of the day we only have 24 hours to harvest, and run metrics. 2) Lowering the cost of entry will in the past has meant less human review of datasets. What steps can we take to retain a more thorough review of catalogs. Too many moldy blueberries and consumers won't buy the basket. 3) We need to schedule harvests when THREDDS servers are operational. It is not uncommon for THREDDS servers to go down as part of a nightly backups or bounces of Tomcat to deal with memory leaks. We have no way of knowing what these schedules are.

The advantage of a WAF centric approach is that it:

1) Enables RAs to schedule harvest at times that make sense for their organization. 2) It distributes processing loads. 3) It reduces the likelihood of duplicate records from multiple services for the same dataset, SOS, ERDAPP, THREDDS.

We are happy to support a decision either way so long as we have a common understanding of the pros and cons.

On Mon, Jun 30, 2014 at 7:39 AM, Rich Signell notifications@github.com wrote:

I'd hate to see this discussion just die with no resolution. @dpsnowden https://github.com/dpsnowden, what do you think the next step should be?

— Reply to this email directly or view it on GitHub https://github.com/ioos/registry/issues/35#issuecomment-47532094.

kwilcox commented 10 years ago

We need to crawl catalogRef elements. They are an integral part of the THREDDS catalog spec.

Can we apply some simple regex rules while crawling THREDDS catalogs? This will avoid harvesting all of the individual members of THREDDS collections. See: https://github.com/kwilcox/thredds_crawler#skip for examples. This could be a 90% solution (not a bad thing). THREDDS catalogs with manual directory scans that are not part of a THREDDS collection will still be a problem. To alleviate those, we need to first identify them (arguably, that should already be done). Then we (read: @rsignell-usgs) can contact the THREDDS maintainer(s) and have them put a pre-defined string inside of the problematic directory name (ie. "Manual Scan" or "Manual Entry") and NGDC can avoid crawling datasets that match the predefined regex. Is it a perfect solution? No. But it gets this train moving forward from the station its been stuck at for a loonggggg time.

As long as the RAs know what to expect, I don't see many problems.

1.) Put the harvest schedule here: https://github.com/ioos/registry. It can't be a simple "we start the harvest at 3:00AM and it finishes whenever". It needs to be broken down by RA. Specific times they need to be sure the THREDDS catalog is alive and kicking. 2.) Document the predefined string(s) that NGDC will ignore when crawling datasets and catalogRefs. 3.) Celebrate.

I'd be happy to crawl any THREDDS catalogs (following catalogRefs) for NGDC before any of this is implemented to get a general idea of how many additional datasets NGDC should expect to encounter.

dpsnowden commented 10 years ago

It looks like we're making headway. It seems like there is general consensus that having an RA managing a WAF of metadata that they generate, curate, and keep up to date is the ideal solution. But, this is not always practical, and, even if every RA and federal data provider are willing to do this, there is still a need for more practical guidance on how to do it. I added a short outline of things that I feel are unclear at the main wiki landing page. https://github.com/ioos/registry/wiki

Can we create a wiki page called "Managing Your Own WAF" and list the steps to do this? I know @ebridger has circulated a script or two and I believe @kwilcox has too. Can someone write up a short HOWTO? @robragsdale can you collect these scripts or short write ups and manage the wiki page? All-if you have any info related to this, please send it to Rob or point him to it. I put a placeholder outline on the landing page of the wiki

Regardless of how this evolves, I think that augmenting our harvesting strategy requires some conversation about #36 too. Any takers there?

geoneubie commented 10 years ago

@kwilcox in the past we discussed using harvest=true in the thredds metadata, I think this would still be preferable over a regex.

dpsnowden commented 10 years ago

Is it possible to turn off the catalogRef limitation on a case by case basis? The following CDIP catalog has hundreds of data sets that @robragsdale has been attempting to register by hand. I think it would qualify as one of @geoneubie 's well curated catalogs and maybe worth following the catalogRef entries. At least as a trial.

http://thredds.cdip.ucsd.edu/thredds/catalog/cdip/archive/catalog.xml

geoneubie commented 10 years ago

@dpsnowden Yes, we'll take a look at in our next sprint which begins on Thursday.

Dave

robragsdale commented 10 years ago

@dneufeld-ngdc Did you take a look at the CDIP catalog in your mid-July sprint?
If there is any information that you have that can be added to the Wiki page, please send it to me @robragsdale. @dneufeld-ngdc @kwilcox commented that harvesting times should be posted in detail. Can you break those down for posting?

dneufeld-ngdc commented 10 years ago

Rob,

Unfortunately no.

On the plus side we added a new developer on Aug 1 and have 2 more starting Sep 1.

The story is in the backlog DOCMAN-615 and I'm cc'ing Anna and Dave Fischman so that they can help keep this on our radar to address soon!

Thanks for your patience, Dave

On Fri, Aug 15, 2014 at 1:31 PM, robragsdale notifications@github.com wrote:

@dneufeld-ngdc https://github.com/dneufeld-ngdc Did you take a look at the CDIP catalog in your mid-July sprint?

If there is any information that you have that can be added to the Wiki page, please send it to me @robragsdale https://github.com/robragsdale. @dneufeld-ngdc https://github.com/dneufeld-ngdc @kwilcox https://github.com/kwilcox commented that harvesting times should be posted in detail. Can you break those down for posting?

— Reply to this email directly or view it on GitHub https://github.com/ioos/registry/issues/35#issuecomment-52348185.

ioos / registry

Make a decision on catalogRef following in EMMA harvesting #35