legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM/cerca.ISC453364.gnm1.ann1.HZJM.gene_models_main.gff3.gz #169

Closed adf-ncgr closed 1 month ago

adf-ncgr commented 1 year ago

not only does it not have AHRD, it doesn't have gene records (mRNA records are at the top). Not sure if anyone cares enough about this one given that we have a newer genome version that supplants it.

StevenCannon-USDA commented 1 year ago

@adf-ncgr - yeah, this one is here just for historical purposes; was used in the prominent Greissmann et al. 2018 paper. @sammyjava - is there a way of flagging noncompliant collections to be excluded from the Mines? Alternatively, we could move things like this to "private" - which sort of invites the question "If there is a gnm2, where is gnm1"?

sammyjava commented 1 year ago

@adf-ncgr - yeah, this one is here just for historical purposes; was used in the prominent Greissmann et al. 2018 paper. @sammyjava - is there a way of flagging noncompliant collections to be excluded from the Mines? Alternatively, we could move things like this to "private" - which sort of invites the question "If there is a gnm2, where is gnm1"?

Well the policy has been, and I think should continue to be, that we try to have complete, valid collections in the public Datastore. Having a gene-less gnm1 is likely to cause more harm than good. People like genes. So I'd request that we move incomplete collections like this to private. The answer to "Where is gnm1?" is "It's broken."

But in terms of "is there a way" - any collection that fails mine validation doesn't get loaded. So if I require that the gff contain genes, this annotation collection would get skipped with an error message. At which point I normally create an issue here asking you or Andrew to fix it.

Allowing incomplete collections in the public Datastore leads to Bad Places. Like not caring whether the collections are complete or validate. I'm already having to hound people about collections.

I had accepted that some annotations didn't include AHRD as a thing that exists, but now with gene search running off LegumeMine I think we should address those deficiencies as well.

So, I plan to add "Has genes" and "genes have GO terms" to the annotation GFF validation. That keeps the push on fixing that kind of stuff, which is how we've been getting the Datastore into shape.

StevenCannon-USDA commented 1 year ago

@sammyjava I buy that. In the case of Chamaecrista: that's in the process of growing up into a proper assembly + annotation now. The first iteration was pretty bad, and arguably the world will be marginally better when it is replaced. (Fun fact: the tissue for the new genome came from my garden, ultimately from seed we brought with us from Minnesota.) I'll move the incomplete Chamaecrista and Cercis assemblies and annotations to the side later today.

adf-ncgr commented 1 year ago

While I am totally on board with deprecating old collections so we don't have to upgrade them whenever we make new requirements, I do have some hesitation about just squirreling them away back in private where no one outside the group can access them at all. Maybe we could (eventually) have some sort of archival location (or perhaps just use the "annex") that we could redirect their URLs to? Think of the poor grad students who spent years working to refute Greissmann et al., told by reviewers that it's clear they wrote their paper using ChatGPT due to the presence of hallucinated ISC453364.gnm1.ann1 ! Just the reproducible angel on my shoulder talking here...

sammyjava commented 1 year ago

I think having a publicly-accessible place for incomplete/archival collections is fine. Then it's clear that they're incomplete, buyer beware, etc. That's something to bring up with @nathanweeks and others, I guess.

nathanweeks commented 1 year ago

Putting deprecated/archival collections in the annex where nobody can stumble upon them, but where they can still be accessed, sounds fine, provided there's a way to redirect to the new location? (Thinking out loud: perhaps mv'ing the collection directory to the annex, creating a symbolic link in its place---so old URLs continue to function---and configure h5ai to not list symbolic links, so a newcomer can't stumble upon it and inadvertently use it?)

adf-ncgr commented 1 year ago

Thanks, @nathanweeks that sounds like an elegant and relatively simple solution to me. One additional thing we'll probably want to do is to remove the metadata for these collections from the github repo (the metadata will of course remain with the collections in their new locations).

sammyjava commented 1 year ago

Thats key for me. If it's not in datastore-metadata then I don't load it. I pull the metadata and then load the collections that go with them. Conversely, collections with valid metadata should be considered to be valid collections.

nathanweeks commented 1 year ago

Thanks, @nathanweeks that sounds like an elegant and relatively simple solution to me. One additional thing we'll probably want to do is to remove the metadata for these collections from the github repo (the metadata will of course remain with the collections in their new locations).

Good point. I seems the addition of a replacement symbolic link to the datastore-metadata repo could cause technical difficulties, so the symbolic link could just exist in the datastore file system.

nathanweeks commented 1 year ago

I'm not seeing a built-in mechanism in the h5ai options.json to prevent displaying symlinks. A couple options:

  1. Create the symlink, and explicitly specify it as a hidden file in the h5ai options.json,
  2. Instead of using a symlink, specify an HTTP redirect (or similar) in an httpd configuration file.
StevenCannon-USDA commented 1 year ago

Thinking about this over the weekend, I am having my doubts about the idea of moving collections to the side and putting symlinks in their place. We did something like this during the big DS restructuring in 2021 (creating symlinks in the old location), but most of the links went stale pretty quickly for various reasons (primarily other renaming and moves I think).

I'd like to float another idea: adding something like a "no_load" flag to the metadata or the collection. This could be an optional boolean in the README, or a hidden "dot" file (.no_load). The collections would remain where they are, but would just be skipped over during Mine loading.

adf-ncgr commented 1 year ago

Something like that seems reasonable to me; though maybe instead of a boolean it could have some explanatory text for humans but it would just be the presence/absence of the attribute that loaders care about?

sammyjava commented 1 year ago

Do we really need to show stuff that's NOT in the Datastore on the Datastore web browser? This seems like adding an unnecessary layer of complexity. Why not just have an archive browser? If it's not in the Datastore, look in the archive. Seems like a solution in search of an actual problem to me.

sammyjava commented 1 year ago

I also wonder if a user gets redirected to the archive from the "valid" Datastore that they have any idea they've gone to a place where annotations lack genes. :)

sammyjava commented 1 year ago

Thinking about this over the weekend, I am having my doubts about the idea of moving collections to the side and putting symlinks in their place. We did something like this during the big DS restructuring in 2021 (creating symlinks in the old location), but most of the links went stale pretty quickly for various reasons (primarily other renaming and moves I think).

I'd like to float another idea: adding something like a "no_load" flag to the metadata or the collection. This could be an optional boolean in the README, or a hidden "dot" file (.no_load). The collections would remain where they are, but would just be skipped over during Mine loading.

No need. I don't load collections that fail validation. And I'm forcing gene_models_main.gff3 to contain genes and GO terms from now on. No need to flag it (which is unreliable given our track record). So from now on, even if I list ISC453364 in project.xml, the annotations won't get loaded. (The genome will, if it passes muster.)

But as I've said, I'd just as soon not have an incomplete collection like that in the Datastore at all. I don't think it meets the expected standard of the LIS Datastore, which is that an annotation contains gene models.

sammyjava commented 1 year ago

Something like that seems reasonable to me; though maybe instead of a boolean it could have some explanatory text for humans but it would just be the presence/absence of the attribute that loaders care about?

Seriously: we need to stop imagining that people that put data in the LIS Datastore are careful and dot all the i's and cross all the t's. Almost 100% of the time I have to fix a new README. Let us not add more tasks for our curators that will be ignored or applied incorrectly!

adf-ncgr commented 1 year ago

I don't think this is something that would typically be part of a new README. This is for the case where we've decided an old dataset is not worth trying to pull into conformity with the standards as they exist now. Any of the proposed solutions involve someone taking an action on that dataset, the question is just what that action should be:

Bear in mind that this isn't just about what goes into the mines, there are other things like GCV that could certainly load datasets without functional characterizations (the Note attributes that started this thread). This is about a mechanism for telling all our internal systems "don't bother" and external users "don't expect too much".

StevenCannon-USDA commented 1 year ago

Regarding the second bullet ("move it to archive (only people who already know it exists can see)") - is there a URL path for accessing data in the annex? That would, I think, give a reasonable fall-back option -- i.e. we move a deprecated or deficient collection there; and if someone says "hey - where'd it go", we can give them the address.

adf-ncgr commented 1 year ago

Yes, the folders are not browseable because not indexed but if you know a URL under: https://data.legumeinfo.org/data/annex/ you can access the data from it. The thing I mildly dislike about this approach is that I'd guess most of our users are more likely to grumble and just not ask us about it. But I can live with that.

sammyjava commented 1 year ago

Well, FWIW I'm out of this conversation because I'll continue to rely on my "own" validation to include collections in the mines. If stuff is in the LIS Datastore that doesn't pass muster for mine loading it doesn't get loaded. (And I'll likely post an issue to datastore-issues.)

StevenCannon-USDA commented 1 year ago

(Feel free to ignore, @sammyjava) I have moved the Cercis gnm1 assembly and annotations over to "annex". In doing so, another possible low-tech solution occurs to me (to alert the rare user of the change): to leave the directory in place, with a "dataset_moved.txt" file. The problem: currently, directories are not browsable in the annex; only full paths to files. @nathanweeks - is it possible to configure the annex to permit users to see terminal directories - for example, /usr/local/www/data/annex/Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM

sammyjava commented 1 year ago

I like it! Don't forget to commit/push datastore-metadata.

    deleted:    Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM/BUSCO/cerca.ISC453364.gnm1.B05Z.busco.fabales_odb10.short_summary.json
    deleted:    Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM/CHECKSUM.ISC453364.gnm1.ann1.HZJM.md5
    deleted:    Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM/README.ISC453364.gnm1.ann1.HZJM.yml
    deleted:    Cercis/canadensis/genomes/ISC453364.gnm1.B05Z/BUSCO/cerca.ISC453364.gnm1.B05Z.busco.fabales_odb10.short_summary.json
    deleted:    Cercis/canadensis/genomes/ISC453364.gnm1.B05Z/CHECKSUM.ISC453364.gnm1.B05Z.md5
    deleted:    Cercis/canadensis/genomes/ISC453364.gnm1.B05Z/README.ISC453364.gnm1.B05Z.yml
StevenCannon-USDA commented 1 year ago

@sammyjava - done. I'll hold off on moving Chamaecrista until we have the new annotations from JGI. Leaving the issue open until we've determined possibilities for (limited) browsing in the annex.

adf-ncgr commented 1 year ago

I'd be fine with just making the annex fully browsable (possibly just via simple apache indexing and excluded from the h5ai magic).

nathanweeks commented 1 year ago

@nathanweeks - is it possible to configure the annex to permit users to see terminal directories - for example, /usr/local/www/data/annex/Cercis/canadensis/annotations/ISC453364.gnm1.ann1.HZJM

Replacing a collection with a stub seems acceptable. I'll note that without an HTTP redirect, any existing URLs to specific files will break---but that might be a feature rather than a bug.

I'd be fine with just making the annex fully browsable (possibly just via simple apache indexing and excluded from the h5ai magic).

If that's acceptable, that's probably simpler to implement than making terminal directories visible (which should also be doable). In either case, it would probably be advisable to have some kind of text (e.g., for a file specified in a HeaderName directive) so users know they're off the beaten path.

StevenCannon-USDA commented 1 year ago

I am also OK with making the annex fully browsable.

nathanweeks commented 1 year ago

I am also OK with making the annex fully browsable.

Done. Default ReadmeName is README.html; default HeaderName is HEADER.html.

adf-ncgr commented 1 month ago

I think there's nothing more to do on this issue, but feel free to reopen if that's an error in judgment.