Programmatically Generate this list from the Public Galaxy Server List page

martenson / public-galaxy-servers

Machine readable list of public Galaxy servers & utilities, visualized:

https://stats.galaxyproject.eu/d/000000020/public-galaxy-servers?orgId=1

MIT License

3 stars 4 forks source link

Programmatically Generate this list from the Public Galaxy Server List page #10

Closed tnabtaf closed 11 months ago

tnabtaf commented 7 years ago

@martin, @erasche, @bebatut,

Since we migrated the public Galaxy server list on the hub from a monolithic web page to directory based approach, I think it would be easy to programmatically generate this CSV from that directory structure. Here's how:

Current columns

`name`

This is title in the server page metadata

`url`

This is url in the server page metadata. Would require checking that everyone of these actually point to the server. (I think they do - I'll be visiting every page anyway.)

`support`

As far as I can tell, these are all email addresses. These do not exist in the current metadata, although sometimes they are in the User Support section of the page content.

Are these all supposed to be a single email address? Are there other options we could do here, like a semicolon separated list of emails, or a URL?

See email_contacts below.

`location`

This is a standard two letter country code.
See home_country_code below

`tags`

I was thinking about adding tags to the server pages and I asked @dannon to look into metalsmith support for tags, but I also told him it was an unimaginably low priority. We can support tags in the page metadata before we do anything with them in the hub. Some of the tags are already on the pages, but with a different name:

server_group: "general"

There are three groups: general, domain, and tool-publishing. general maps to genomics, and tool-publishing maps to tools. Those two are easy.

Domain-specific tags like phage aren't currently supported in the hub.

See tags below.

Proposed new columns and metadata

`info_page`, in CSV

URL of the server's information page on the hub.

`email_contacts`, in Hub

Copied from support in CSV.

`home_country_code`, in Hub

Copied from location in CSV.

But ...

Country codes are not as informative as country names

Displaying "DK" in the hub is not informative. But, country names are ambiguous and 5 names can map to one country.

What say ye?

More location?

@bebatut and I have discussed having Event locations be free form text, but be specific enough that we could pass the string to a mapping service, and it would return some geolocation.

Should we do that with location, or is country all we'll ever care about (or all we care about now :-)?

I'm OK with country code for now.

I just don't want to display it, and it's easy to change this programmatically later if we want to go there.

`tags`, in Hub

Initially populated from tags in CSV. Combined with server_group when updating tags in CSV.

Mixed Model

We don't have to go fully one way or the other. We could use a mixed model where the file can be both programmatically and manually updated. The program would read in the CSV first, and then update information in place. It would report on any updates it did, and on anything that's in the CSV, but not in the Hub.

Differences would be reconciled before the new CSV is pushed.

hexylena commented 7 years ago

programmatically generate this CSV

we definitely need to reconcile the multiple different sources with different galaxies.

([rant] Personally I strongly prefer one giant yaml file to a directory of small ones which require parsing. The sheer thought of splitting nicely structured data (which could have been validated against a schema) into a structured + unstructured component mixed together in markdown, and then splitting that into 1000 tiny directories is giving me a headache.)

I'd be happy if we did the right thing in this repo and then hub could consume it + post-process it in whatever way it wants to.

Alternatively if hub is open to some changes, then we could use that as the source of all truth. I can make a PR there to refactor their server list into something easier for everyone else to consume.

Event locations be free form text, but be specific enough that we could pass the string to a mapping service, and it would return some geolocation.

I'd argue strongly against this. That ties us to a geocoding service and makes the data literally unusable without it. Using ISO country / administrative division codes is a much better solution since it doesn't require making requests to a third party service for what could be static data. (I volunteer to manually do the geocoding for whatever each users enters for their instance, if users aren't happy to lookup the appropriate code themselves.)

Support - Are these all supposed to be a single email address? Are there other options we could do here, like a semicolon separated list of emails, or a URL?

A list of emails/URLs would probably be good.

tags

Would love these.

mixed model

I think KISS principle applies here. There should be one and only one authoritative source for this data, and manual updates / any more than the absolutely minimum amount of scripting probably should be avoided.

Displaying "DK" in the hub is not informative. But, country names are ambiguous and 5 names can map to one country.

I'm sure we can just pull the list of ISO 3166-1 alpha 2 and decode these into more "human friendly" names? Or maybe I'm misunderstanding the comment.

Should we do that with location, or is country all we'll ever care about (or all we care about now :-)?

We could switch to ISO 3166-2 and that'd be fine, but more work for the admin / author. I'd personally like it since it'd make the map page even more attractive / detailed. I'm not really caring so much, I'm not sure how much the end user cares. This is probably more important for the GTN map than it is for "where are the servers"

tnabtaf commented 7 years ago

@erasche, thanks for the comments.

([rant] Personally I strongly prefer one giant yaml file to a directory of small ones which require parsing. The sheer thought of splitting nicely structured data (which could have been validated against a schema) into a structured + unstructured component mixed together in markdown, and then splitting that into 1000 tiny directories is giving me a headache.)

The 1000 tiny directories is the metalsmith way. I won't defend that, or fight that.

I did think about putting everything on the server pages in YAML and started down that path, but the yaml markup quickly became unreadable, so I backed off of that. Everything is still tightly structured in the Markdown content (for now :-), so we could reconsider that decision. Keep in mind that the more information we move into YAML, the less friendly editing becomes for non-YAML experts.

I dunno, maybe if I had a clean example YAML file to go off of then people could just copy that when adding a new server. (Eric, if you feel so inclined, pick a random server page, and convert the whole thing to YAML. I'm game if I have a template to work from - and, for now, I could still make this change programmatically.)

I'd be happy if we did the right thing in this repo and then hub could consume it + post-process it in whatever way it wants to.

I need to get used to that idea. I doubt that CSV is robust enough to do this, but if we had a YAML template, it would be doable. It takes this information out of the Hub repo, and generally I'd be against that. However, this list seems important enough to justify that.

Alternatively if hub is open to some changes, then we could use that as the source of all truth. I can make a PR there to refactor their server list into something easier for everyone else to consume.

I think we are. Send me that template. :-)

Event locations be free form text, but be specific enough that we could pass the string to a mapping service, and it would return some geolocation.

I'd argue strongly against this.

I figured you might, and I withdraw all suggestions beyond the two letter country code. I'll make figuring out how to translate that to a country name be a low-level priority

mixed model

I think KISS principle applies here.

I'm good with too. I was just worried about the multiple source that were given as sources of this file.

hexylena commented 7 years ago

First, apologies for the snippiness in the first comment. I spent the majority of my day writing schemas to validate yaml files leading to some of my biases / strong opinions as I found numerous issues in our data.

Keep in mind that the more information we move into YAML, the less friendly editing becomes for non-YAML experts.

Yeah, I always struggle with this. I'd personally rather the data is strongly validated and programatically accessible than it is human editable. But this is not a useful opinion given that that doesn't match the audience who will be editing these.

template

Here are two examples. It doesn't have to be anything fancy, sticking markdown in blocks is really OK from my perspective (or we could use the existing schema with link_text and link_url if that works better for other reasons), etc. I.e. the metadata contributors already have to write, but just stick the other stuff in there too so we can actually parse it instead of having to parse out the markdown.

---
cpt_server:
  title: "Center for Phage Technology (CPT)"
  url: "https://cpt.tamu.edu/galaxy-public/"
  server_group: "domain"
  server_links: 
    - "[Center for Phage Technology (CPT) Galaxy Server](https://cpt.tamu.edu/galaxy-pub/)"
    - "[CPT Home Page](https://cpt.tamu.edu/)"
  summary: "Phage biology and automated annotation. "
  image: "/src/public-galaxy-servers/CPTLogo.png"
  comments: |
     * Server includes many genbank and gff3 processing tools, largely focused on annotation of phages.
  user_support:
    - "[FAQ](https://cpt.tamu.edu/galaxy-faq-ever-needed-a-question-answered/)"
    - "[Email: Cory Maughmer](mailto:cory.maughmer@tamu.edu)"
  quotas:
    unregistered: 50 Mb
    registered: 10 Gb
    comments: The administrator can increase your quota on request.
  sponsors: 
    - "[Center for Phage Technology (CPT)](https://cpt.tamu.edu/)"
    - "Texas A&M University"

gvl_qld:
  title: "GVL QLD"
  url: "http://galaxy-qld.genome.edu.au/"
  server_group: "general"
  server_links: 
    - "[Genomics Virtual Lab GVL-QLD](http://galaxy-qld.genome.edu.au/)"
  summary: "General purpose Galaxy based on the [Genomics Virtual Lab platform](https://genome.edu.au/). "
  image: "/src/public-galaxy-servers/GenomicsVirtualLab300.png"
  comments: | 
    * Has 16 virtual CPUs.
  user_support: 
    - "[GVL Help](https://www.gvl.org.au/)"
    - "Follow tutorials at [GVL Learn](https://www.gvl.org.au/) and use [Galaxy Tut](http://galaxy-tut.genome.edu.au/)"
  quotas: 
    unregistered: 5 GB
    registered: 100 GB
    notes: |
      * University of Queensland and collaborators: 1TB
      * Other Australian Researchers: 600GB (make sure you register with your Institute email address)
  sponsors:
    -  "[Genomics Virtual Lab](https://genome.edu.au/)"
    -  "[University of Queensland Research Computing Centre](http://www.rcc.uq.edu.au/)"

And for fun here's a kwalify/pykwalify schema which can be used to validate that data.

---
type: map
mapping:
    "=":
        type: map
        mapping:
            title:
                type: str
                required: true
            url: # Could use a validator on url syntax.
                type: str
                required: true
            server_group: # Ensure no one has typo'd the server_group
                type: str
                required: true
                enum: ['general', 'domain', 'tool-publishing']
            server_links: # Could validate against link_text/link_url as well.
                type: seq
                sequence:
                    - type: str
            summary:
                type: str
            image:
                type: str
            comments:
                type: str
            user_support:
                type: seq
                sequence: # Could validate against link_text/link_url as well.
                    - type: str
            quotas:
                type: map
                mapping:
                    # We could make this more advanced with regex /
                    # requiring specification in MB or GB and making it
                    # an integer.
                    unregistered:
                        type: str
                    registered:
                        type: str
                    notes:
                        type: str
            sponsors:
                type: seq
                sequence:
                    - type: str

I need to get used to that idea. I doubt that CSV is robust enough to do this, but if we had a YAML template, it would be doable. It takes this information out of the Hub repo, and generally I'd be against that. However, this list seems important enough to justify that.

Oh it definitely isn't. If we did it here, I'd insist on converting the CSV to yaml. the CSV renders nicely in github but who cares. Github doesn't even give you a table-based editor for it.

I'll make figuring out how to translate that to a country name be a low-level priority

they don't change very often. We could really just hardcode the list for lookup. They're 2-letter ISO country codes here because that's what the world map plugin uses in grafana. https://grafana.com/plugins/grafana-worldmap-panel We could easily switch to lat/lon or 3-letter country codes. Anything else I'd have to post-process into country codes which I could live with if need be.

tnabtaf commented 6 years ago

@erasche: I think this would be workable. At the time I tried to do this, I didn't know how to get markdown in the YAML to work. Now, I know how to do this.

If this can wait until the last week of October/first week of November, then I can work on the translation in the hub. I can also go through and add Citations sections at that time (a manual process).

hexylena commented 6 years ago

Oh! I didn't realise that was the issue, that metalsmith has so much difficulty for things like markdown in yaml. I'm really used to jekyll and other systems in which this is really normal / well supported.

Again, I'm sorry, I don't mean to be difficult, I have strong preferences, these should be tempered with other's opinions, don't let me go pushing things on y'all just because they make sense to me (and not necessarily the silent majority)

Of course can wait, zero rush on any of this.

dannon commented 6 years ago

I didn't realise that was the issue, that metalsmith has so much difficulty for things like markdown in yaml. I'm really used to jekyll and other systems in which this is really normal / well supported.

Just have to push back on the perceived failure of metalsmith here; it's really not a fair judgement. It's perfectly well-supported and normal in metalsmith, too -- we just have the build pipeline set up to not automatically attempt to convert data in yaml fields because that's not the common case for us. The markdown is generally in the markdown, the yaml data is in the frontmatter. For when we do have markdown in the yaml frontmatter, @tnabtaf now knows how to do it, when we want to do it.

hexylena commented 6 years ago

Is there no equivalent concept of site data like there is in jekyll / hugo / other static site generators? Just a folder where you dump yaml files that are used in templates, etc.?

dannon commented 6 years ago

Sure. Keep in mind that metalsmith is very DIY, maximum flexibility to do basically anything you want.

We use yaml files for the menu, for example (https://github.com/galaxyproject/galaxy-hub/blob/master/src/config/menu.yaml). Which is loaded here: https://github.com/galaxyproject/galaxy-hub/blob/master/build.js#L190

But basically all of the other data on the site is in per-object yaml frontmatter of composite markdown files. This makes it way easier to deal with individual content items, instead of digging through large comprehensive yaml blobs.

hexylena commented 6 years ago

Cool, thanks. That's good to know.

tnabtaf commented 6 years ago

Note for future reference: @tnabtaf's ignorance of most things, doesn't say a thing about most things.

I'm trying to catch up! :-)

tnabtaf commented 6 years ago

Hi All,

I haven't forgotten about this and I plan to work on this starting in bout 10 days. I thought I would update the thread. @jxtx had a conversation with someone about adding domain tags to the server descriptions.

I'm all for this if we can identify ontologies that cover our bases. As I see it, there are three general domains:

Parts of the tree of life: Whale Shark! Viruses, etc.
Disciplines: Genomics, Computational Chemistry, Social Science, etc.
Methodologies: RNA-Seq, machine learning, CLIP-Seq, etc

I haven't worked with ontologies for years, but I'll do some research when I get to this. My vague plan is to

Identify an all-encompassing ontology service, and use that.
- if that doesn't exist then we'll use individual ontology services as needed.
In the server metadata, add an array of these:
- tag id
- tag URL (needed if all encompassing service does not exist)
- tag text (sure, this is redundant, but I don't want to pull it live when building the hub)

Tag IDs and text would be displayed and link to a URL.

If you see better ways to do this, please post here by, say, November 8.

hexylena commented 6 years ago

:+1: ontologies

martenson commented 11 months ago

done, thanks to y'all

martenson / public-galaxy-servers

Programmatically Generate this list from the Public Galaxy Server List page #10

Current columns

name

url

support

location

tags

Proposed new columns and metadata

info_page, in CSV

email_contacts, in Hub

home_country_code, in Hub

Country codes are not as informative as country names

More location?

I'm OK with country code for now.

tags, in Hub

Mixed Model

`name`

`url`

`support`

`location`

`tags`

`info_page`, in CSV

`email_contacts`, in Hub

`home_country_code`, in Hub

`tags`, in Hub