legumeinfo / graphql-server

A GraphQL server that supports the Legume Information System and related biodata websites
Apache License 2.0
2 stars 3 forks source link

5.1.0.3 model changes #67

Closed sammyjava closed 11 months ago

sammyjava commented 1 year ago

This is the PR to update to the 5.1.0.3 mine model. I am about to promote LegumeMine 5.1.0.3 to production, so I'd like to merge these updates into main when I do that.

The changes conform to the mine changes listed in https://github.com/legumeinfo/mine-issues/issues/109

Testing can be done against the dev MiniMine or dev LegumeMine.

sammyjava commented 1 year ago

No changes are required on jekyll-legumeinfo or web-components, it's all back-end stuff. Just have to make sure the graphql-server is hitting the mine of the same version. (Here comes the need to enforce that with the web-properties query.)

alancleary commented 1 year ago

@sammyjava Can you explain the distinction between a gene and a pan-gene? I'm mostly wondering if the panGenes field in the Gene type will only have a value if the gene is a pan-gene. If this the case, then it might make sense to make gene and pan-gene distinct types in the GraphQL API, where only the pan-gene type has the panGenes field.

alancleary commented 1 year ago

Just have to make sure the graphql-server is hitting the mine of the same version. (Here comes the need to enforce that with the web-properties query.)

I was working on this last month and then got distracted. I'll try and circle back on it for this release.

sammyjava commented 1 year ago

@sammyjava Can you explain the distinction between a gene and a pan-gene? I'm mostly wondering if the panGenes field in the Gene type will only have a value if the gene is a pan-gene. If this the case, then it might make sense to make gene and pan-gene distinct types in the GraphQL API, where only the pan-gene type has the panGenes field.

There isn't anything called a "pan-gene" in the mine, the pan-gene sets are just buckets of genes, akin to gene families, as Steven described on Friday. So they're pretty minimal, at least so far. The Gene.panGenes attribute is just a list of genes that are "like this gene" as Steven has defined, intended for his pan-genes web component that I think Simon is coding.

alancleary commented 1 year ago

@sammyjava Can you explain the distinction between a gene and a pan-gene? I'm mostly wondering if the panGenes field in the Gene type will only have a value if the gene is a pan-gene. If this the case, then it might make sense to make gene and pan-gene distinct types in the GraphQL API, where only the pan-gene type has the panGenes field.

There isn't anything called a "pan-gene" in the mine, the pan-gene sets are just buckets of genes, akin to gene families, as Steven described on Friday. So they're pretty minimal, at least so far. The Gene.panGenes attribute is just a list of genes that are "like this gene" as Steven has defined, intended for his pan-genes web component that I think Simon is coding.

I understand all of that. My question is are pan-genes the only genes that are going to have values for the panGenes field of the Gene type?

sammyjava commented 1 year ago

Oh, as for participation: ALL the genes in PhaseolusMine are in a phaseolus pan-gene set.

sammyjava commented 1 year ago

Granted, some of them are singletons so their panGenes list will be empty. I don't think that's a problem.

sammyjava commented 1 year ago

And looking at the data, there are only 10 Phaseolus genes that are single members of a pan-gene set. Out of 224,811 genes. So it's exceedingly rare!

alancleary commented 1 year ago

Oh, as for participation: ALL the genes in PhaseolusMine are in a phaseolus pan-gene set.

Sorry, I don't think I'm being clear in my question. I want to know if there is a distinction between a gene and a pan-gene. My understanding is that there is a difference, which means some genes (i.e. not pan-genes) won't have a value for the panGenes field of the Gene type. Edit: And the gene's in a pan-gene's panGenes field will not be pan-genes.

sammyjava commented 1 year ago

Oh, as for participation: ALL the genes in PhaseolusMine are in a phaseolus pan-gene set.

Sorry, I don't think I'm being clear in my question. I want to know if there is a distinction between a gene and a pan-gene. My understanding is that there is a difference, which means some genes (i.e. not pan-genes) won't have a value for the panGenes field of the Gene type.

None that I know of once Steven gets pan-gene sets built for all the genera. Right now it's

 Arachis.pan1.4LN9
 Cicer.pan1.SV8C
 Glycine.pan3.YWTW
 Medicago.pan1.XXQ6
 Phaseolus.pan1.X2PC
 Vigna.pan1.X2PC

But honestly, I missed the memo describing what you call "pan-genes." All I know is that pan-gene sets are collections of genes. There are, of course, the "reference" or "canonical" gene sequences associated with each set, but those aren't stored in the mines. If they were, they would certainly be a different object. I'm just loading the buckets with their identifiers.

StevenCannon-USDA commented 1 year ago

@alancleary - This is correct: "some genes (i.e. not pan-genes) won't have a value for the panGenes field of the Gene type." We will only calculate pan-gene sets for genera that have several (probably >=4) annotations. On the other hand, I expect that all genes (even from sparsely represented genera) will be assignable to a gene family.

To put it another way: a pan-gene set is a special type of gene family, aka orthogroup. We'll have orthogroups at two taxonomic depths: one at the genus level (where there is sufficient species coverage) and one at the level of the legume plant families.

sammyjava commented 1 year ago

Hrm, OK. I don't see any issue with having empty panGene lists, though. And clearly for Phaseolus, that happens for only 10 out of 224,811 Phaseolus genes, which seems hardly worth motivating a new Type. Plus, there is nothing called a PanGene in the mine. So it would be a GraphQL creation.

sammyjava commented 1 year ago

Hrm, OK. I don't see any issue with having empty panGene lists, though. And clearly for Phaseolus, that happens for only 10 out of 224,811 Phaseolus genes, which seems hardly worth motivating a new Type. Plus, there is nothing called a PanGene in the mine. So it would be a GraphQL creation.

OHHH Steven's talking about genera that don't have a lot of annotations. Fine. That's on the Datastore (and research), not anything biological. ALL genera have enough species, they just haven't been sequenced and annotated yet. As far as I know!

sammyjava commented 1 year ago

Basically, a "pan-gene" is "a gene from a popular genus like Glycine or Phaseolus with lots of annotations. Not a lonely wallflower like Aeschynomene". Not a biological definition as far as I can tell. Correct, @StevenCannon-USDA ? Or is there something biological about Aeschynomene preventing it from having pan-gene sets?

alancleary commented 1 year ago

Thanks for clarifying @StevenCannon-USDA. I think the follow-up question is do we want pan-genes to be treated like any other gene in LIS? For example, when you perform a gene search should pan-genes be include in the results, or should the results be limited to non-pan-genes? In the latter case, there could be a pan-gene column akin to the gene family column describing what pan-gene a gene is a part of.

sammyjava commented 1 year ago

We could treat them just like we treat gene families. There are genes that are not in a gene family as well. In fact, PhaseolusMine has 11,741 genes that are not assigned to a gene family. We don't have a special name for the genes that are assigned to gene families.

sammyjava commented 1 year ago

But keep in mind that we need the linkout service to provide the list of genes for a pan-gene set if we are to add the pan-gene set to the gene search results. Just as we should have a gene family linkout (which I think is in the works, right @adf-ncgr , if not already in existence?) Correction: I guess the Funnotate gene linkout is essentially a gene family linkout. Rather than pointing to the LegumeMine page for the gene family, which is what I was thinking of for gene families and pan-gene sets.

I think for consistency it would make sense to have linkouts for every column in the gene results table. So a gene family linkout would include a link to the tree and perhaps the mine page; a pan-gene set linkout would include the link to the mine page and any other pan-gene set visualizations we invent.

StevenCannon-USDA commented 1 year ago

@alancleary "do we want pan-genes to be treated like any other gene in LIS?" - My inclination is "no" - I think in agreement with @sammyjava: "treat them just like we treat gene families". The only reason I hesitate is that the gene families that I've recently calculated do use the pan-genes as substrates, taking the same role as genes from e.g. Arabidopsis. Nevertheless, we still have genes, some of which are members of pan-gene buckets and most of which are also assigned to gene families. In other words: genes and buckets (of two types).

adf-ncgr commented 1 year ago

yes, a gene family linkout service is in the works. I think it makes sense to treat the pan-gene analogously to gene family, ie not return them in gene searches (except insofar as genes will indicate their membership in a pan-gene collection as they do for families).

sammyjava commented 1 year ago

OH WOW I finally understand @alancleary 's question! The idea being that we'd EXTEND the search to include all the genes that share pan-gene sets with the genes that get hit by the search! No, I think that would be really confusing, especially on things like identifier searches where it would make no sense. Best to stick with the current gene family practice. With the addition of a pan-gene set linkout service.

alancleary commented 1 year ago

OK, so it sounds like pan-genes should have their own type in the GraphQL API. @sammyjava Will anything discussed here change the InterMine model or will we just make the distinction in GraphQL?

sammyjava commented 1 year ago

OK, so it sounds like pan-genes should have their own type in the GraphQL API. @sammyjava Will anything discussed here change the InterMine model or will we just make the distinction in GraphQL?

I completely don't understand why you'd create a new type for genes that happen to be in PanGeneSets. There is already a PanGeneSet type that contains its Genes. Having a new type of Gene makes no sense to me, just as we don't have another type of Gene that happens to be a member of a GeneFamily. I'm lost. I think you need to take this over.

alancleary commented 1 year ago

I completely don't understand why you'd create a new type for genes that happen to be in PanGeneSets. Having a new type of Gene makes no sense to me

Someone please correct me if I'm wrong: My understanding is that each pan-gene you load into InterMine creates a pan-gene set AND a gene as a representative of the genes in the set. As Steven indicated, these representative genes are the only genes that will have a value for the new panGenes field in the GraphQL Gene type (i.e. the genes the pan-gene represents) and they should be excluded from all regular gene stuff, such as the gene search. So to me it sounds like the API should distinguish these pan-genes (i.e. representatives of a pan-gene set) as different from non-pan-genes (i.e. the genes that may be in a pan-gene set) so users don't have to make the distinction themselves. Am I wrong in my understanding of this?

adf-ncgr commented 1 year ago

@alancleary I don't think we are making anything new for the gene that is the representative/exemplar of a pangene set. The pangene set could refer to it as an exalted member of the set, but we're not going to represent it twice. I think what Steven indicated before (if I'm understanding what you're referring to now) is that not all regular gene will have memberships in pangene sets, not that we'll treat exemplars in a special way.

NB: I'm a little confused myself about what is "the new panGenes field in the GraphQL Gene type" - @sammyjava said earlier "The Gene.panGenes attribute is just a list of genes that are "like this gene" as Steven has defined, intended for his pan-genes web component that I think Simon is coding." but I don't get why we would not just reference the pangene set as we do for gene families (which seems to be what we're mostly agreeing to within this thread- treat pangene set membership just like gene family membership).

alancleary commented 1 year ago

Thanks for the clarification @adf-ncgr. I think the panGenes field is what's been confusing me. Having such a field makes sense to me if there's special genes that represent a pan-gene set. But if you're just trying to get similar genes by way of pan-gene sets then I think the gene->panGeneSets->Genes pattern makes more sense. It's also (somewhat) consistent with how we're already handling gene family assignments: gene->geneFamilyAssignments->geneFamily->genes.

sammyjava commented 1 year ago

Thanks for the clarification @adf-ncgr. I think the panGenes field is what's been confusing me. Having such a field makes sense to me if there's special genes that represent a pan-gene set. But if you're just trying to get similar genes by way of pan-gene sets then I think the gene->panGeneSets->Genes pattern makes more sense. It's also (somewhat) consistent with how we're already handling gene family assignments: gene->geneFamilyAssignments->geneFamily->genes.

True, and that's what I had before I recently added the Gene.panGenes type to aid in the pan-genes search that Simon is working on so that a Gene's fellow pan-genes could be resolved by the Gene resolver within GraphQL. (Recall that that web component says "given a gene, give me its pangenes".) My impression is that we don't want to code graphql type resolution into the web components, we want that all done by GraphQL. Maybe I'm overthinking that constraint.

Right now we're NOT getting the genes in the same gene family as a query gene in a web component. So this is new and different.

I don't like having Gene.panGenes either, and I don't have that collection in the mine model. But if we only resolve types of types in GraphQL, not in web components, I'm not sure how else we do it. If we're willing to write resolution code on the web-component side (given a query gene, get its pan-gene set(s), then get the genes in those pan-gene set(s), excluding the query gene) then we definitely do not need Gene.panGenes. I'm all for that, if that's ok, and will happily yank Gene.panGenes out! :)

alancleary commented 1 year ago

True, and that's what I had before I recently added the Gene.panGenes type to aid in the pan-genes search that Simon is working on so that a Gene's fellow pan-genes could be resolved by the Gene resolver within GraphQL. (Recall that that web component says "given a gene, give me its pangenes".) My impression is that we don't want to code graphql type resolution into the web components, we want that all done by GraphQL. Maybe I'm overthinking that constraint.

Right now we're NOT getting the genes in the same gene family as a query gene in a web component. So this is new and different.

I don't like having Gene.panGenes either, and I don't have that collection in the mine model. But if we only resolve types of types in GraphQL, not in web components, I'm not sure how else we do it. If we're willing to write resolution code on the web-component side (given a query gene, get its pan-gene set(s), then get the genes in those pan-gene set(s), excluding the query gene) then we definitely do not need Gene.panGenes. I'm all for that, if that's ok, and will happily yank Gene.panGenes out! :)

We're not going to bend the GraphQL API to accommodate a single use case. If we were using SQL would we add a foreign key to a table every time we wanted data that wasn't one JOIN away? You can get the pan-genes of a gene by way of a gene's panGeneSets field, so if Simon wants pan-genes in his gene search results then he can write a shim to make the structure of the data match the component. Please yank the panGenes field from the Gene type.

sammyjava commented 1 year ago

We're not going to bend the GraphQL API to accommodate a single use case. If we were using SQL would we add a foreign key to a table every time we wanted data that wasn't one JOIN away? You can get the pan-genes of a gene by way of a gene's panGeneSets field, so if Simon wants pan-genes in his gene search results then he can write a shim to make the structure of the data match the component. Please yank the panGenes field from the Gene type.

Good point! Will yank! (And I was overthinking the GraphQL-does-everything approach.) I guess a good rule of thumb is that if I create a new type under a type that does not exist in the mine, there has to be a fundamental reason for doing so, not simply to provide a list of results for a particular web component.

sammyjava commented 1 year ago

OK, Gene.panGenes has been removed along with associated code in the latest commit to this branch. Feels good.

sammyjava commented 1 year ago

Also note that MiniMine is now 5.1.0.3 in anticipation of this merge. (As well as a few other mines as I chug along, but not LegumeMine.)

alancleary commented 1 year ago

Thanks for the updates and resolving the merge conflict @sammyjava. I have a paper deadline I need to focus on for a couple hours. I'll finish reviewing this PR after that.

sammyjava commented 1 year ago

Thanks, I'll merge to main when LegumeMine 5.1.0.3 gets promoted, which will probably not be until I get back.

sammyjava commented 11 months ago

The PR that keeps on giving! I've committed the Query.traits update that adds trait search on studyType ("GWAS" or "QTLStudy" only), publicationId (a plain DOI like "10.1007/s00122-006-0217-2" or a PMID like "3416125"), or author which is CONTAINS matched against Author.name in the mine, so best to suggest a last name like "Cannon" and accept that Steven and Ethy will both show up in the results. :) (Although Steven is remarkably consistent as "Steven Cannon" while Ethy has three versions in LegumeMine: "Ethalinda K.S. Cannon", "Ethalinda K. S. Cannon", and "Ethalinda K S Cannon".) Author names are a PITA, which is why ORCID was invented.

sammyjava commented 11 months ago

@That-Thing forgot to mention that you're all set for updating your web component to NOT send genus or species to the graphql query. You do have Trait.qtlStudy.organism.species and Trait.qtlStudy.organism.species if you want to display those (there are a few soja GWAS in GlycineMine).

That-Thing commented 11 months ago

Should I do some sort of filtering locally or should I just disable the fields for now?

sammyjava commented 11 months ago

Should I do some sort of filtering locally or should I just disable the fields for now?

If you're hitting GlycineMine you'll only get Glycine results, and I don't think folks care much if they are max or soja. I'd implement the single-genus tweak that you did on the gene search to show only Glycine in the genus selector for GlycineMine (useful for PeanutBase as well where he'll show Arachis) and freeze "-- any --" on the species selector. Then with 5.1.0.4 you can just unfreeze that selector. @alancleary may have a different take on this but I think this is what we were saying yesterday.

sammyjava commented 11 months ago

OK @alancleary I'd like to promote GlycineMine and LegumeMine 5.1.0.3 to production now. My understanding is that to get the production graphql-server updated, which is running off the Docker code, we need to tag a new release here after merging this PR to main. It's been pretty heavily tested against MiniMine 5.1.0.3, including the recent addition of the trait search parameters.

I would presume the new tag would be 1.1.0 since it includes new/changed types and new/changed API methods? I'd think we'd bump that level when we support a new mine version, since those are, by definition, data model changes (I don't version bug fixes or web app updates running off the same data model).

Let me know if I can go ahead and merge this into main; I presume a tag is created separately, as usual. Or do so yourself.

sammyjava commented 11 months ago

By the way, these mine updates and graphql-server update are non-breaking in terms of what we're running on the jekyll-legumeinfo site, and I think that's likely but not guaranteed to generally be the case. They are required for the trait search web component, which is built against GlycineMine 5.1.0.3 and graphql-server:5.1.0.3-model-updates, but that's not in production yet.

I guess another task is to notify the "others" that a new graphql-server tag has been released so they update theirs, namely the ones being run for jekyll PeanutBase (@svengato ?) and the one being run for jekyll SoyBase (@That-Thing ?).

alancleary commented 11 months ago

@sammyjava You've added 15 commits since my last review so I'd like to give it one last review before merging. Shouldn't take too long since I only need to review and test the files that have changed since the last review.

I agree we should tag a release after merging. There's a few steps to that:

  1. Update the dependencies. The dependabot PR backlog is not comprehensive and doesn't include transitive dependencies, and merging them one at a time tends to break things so the best way to do this is manually with npm in a separate branch.
  2. Update the base image version in the Dockerfile (it looks like we should update to the most recent node LTS version) and do a test build and run of the Docker image on you local machine so you know if something is going to go off the rails before trying to do an automated build.
  3. Bump the version number in package.json. I think you're right that 1.1.0 is the logical next version.
  4. Tag a release on GitHub. It's nice when the version bump is the last commit before tagging a release so it's clear what triggered the automation in the repository's Actions log

This is more thinking out loud; the list I just outlined should probably be put somewhere as a Pre-Release Checklist.

sammyjava commented 11 months ago

This is more thinking out loud; the list I just outlined should probably be put somewhere as a Pre-Release Checklist.

The wiki on this repo seems like a good place for that. Let me know if you'd like me to start that list and you can edit it.

alancleary commented 11 months ago

This is more thinking out loud; the list I just outlined should probably be put somewhere as a Pre-Release Checklist.

The wiki on this repo seems like a good place for that. Let me know if you'd like me to start that list and you can edit it.

I think making it a PR template makes sense. You'll be making a PR for the release anyway so it puts all the things the PR should include right in front of your face.

sammyjava commented 11 months ago

OK, you probably saw that I started one in the pull-request-template branch, just copying your out-loud thinking.

alancleary commented 11 months ago

OK, you probably saw that I started one in the pull-request-template branch, just copying your out-loud thinking.

Thanks. An email I just sent reminded me that updating the compatible intermine version, if necessary, should be on the checklist.

sammyjava commented 11 months ago

@That-Thing @svengato @adf-ncgr The graphql-scripts modifications for the 5.1.0.3 mine model are now in the main branch. I've deleted the 5.1.0.3-model-changes branch to avoid confusion.

sammyjava commented 11 months ago

@That-Thing @svengato @adf-ncgr And Alan should be tagging a new release to make it official. I guess that's the change that indicates the servers need to be updated.

StevenCannon-USDA commented 11 months ago

I notice that the linkout service from the Glycine gene search is failing: https://dev.soybase.org/tools/search/gene.html Might that be related to this update? I suspect not, but you're the ones who would know about the linkout service. Let me know if I should open a separate issue.

adf-ncgr commented 11 months ago

@StevenCannon-USDA I just tried it on the page you sent (using example gene Glyma.13G357700) and it seemed to be working fine? can you check again and describe in more detail how to reproduce if it still seems to be a problem?

alancleary commented 11 months ago

Can we move this conversation to a separate issue? This PR is already pretty muddled up!

StevenCannon-USDA commented 11 months ago

@adf-ncgr @alancleary The linkout service from that page works fine now; maybe someone fixed something or maybe it was a transient glitch. Sorry for the tangent.