pombase / canto

The PomBase community curation tool
https://curation.pombase.org
Other
18 stars 7 forks source link

Next steps for GAF export #2255

Closed ValWood closed 3 years ago

ValWood commented 4 years ago

https://github.com/pombase/canto/issues/2098

I'm assuming the DB namespace and assigned_by are fixed. The file seems fine to me. At some point fairly soon we will need to generate GPAD for submission but that hasn't happened yet so I think we can continue with the GAF for now )

  1. has_qualifier(PBHQ:002) currently in column 17 should be in the qualifier column (4) and should resolve (to either NOT, contributes_to or colocalizes_with) Hopefully we will not have used 'colocalizes _with)

  2. Taxon column. For most GO terms only one taxon ID is required. However, for GO terms in the "GO:0044419 JSON interspecies interaction between organisms" branch (i.e this term and ALL descendants), any annotation should have the annotated gene taxon PLUS the taxon of the species that the interaction is with.

  3. Create a permanent location to export the file to regularly (i.e monthly) so GO can pick it up

  4. Find out if GO want the TAS (we might filter those for submission)

  5. Ensure that only annotations from approved sessions are exported

  6. Sanity check the annotation after submission (GO rules will report any errors)

jseager7 commented 4 years ago

@kimrutherford Expanding on the above, I think you'll need to look at the following tasks:

  1. Move has_qualifier extensions from column 16 (Annotation Extension) into column 4 (Qualifier) of the GAF export, and replace the PBHQ ontology term IDs with their labels / values, as described in the GO documentation. (I would put an example here, but I don't know how to get the labels for these PBHQ terms.)

  2. For GO:0044419 'interspecies interaction between organisms' and its children, include the taxons for both species of the interaction in the Taxon colum (13), separated by a pipe. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction (see GO documentation).

  3. Add an option to canto_export.pl to only export approved sessions when using the gaf export mode, analagous to the --dump-approved option for the canto-json export type.

I can look at finding a permanent location to host the export file. One of our GitHub repositories might be a good option.

I'm not sure if we need to contact GO directly to find out if they need the Traceable Author Statement evidence code.

jseager7 commented 4 years ago

Also, the issue with wrong database name in the DB and Assigned By columns should be fixed now, since I added the correct configuration in our canto_deploy file:

export:
  gene_association_fields:
    db: "PHI-base"
    assigned_by: "PHI-base"
ValWood commented 4 years ago

@pgaudet @vanaukenk

We are mainly curating pathogen host interactions, but we do the GO annotation if the experiments support it. Hoevever, sometimes we add TAS for GO terms from the introductions so we can fill annotation gaps . Should we include these in the submission, or would you rather only take the experimental annotations.

pgaudet commented 4 years ago

I would say we favor EXP but TAS are still accepted ? With the idea that if a curator annotates a gene, TAS can gradually be replaced.

@vanaukenk do you agree ?

vanaukenk commented 4 years ago

Yes, I agree - EXP is most desirable, but if TAS annotations are filling in annotation gaps, we would certainly take them.

I would not make TAS annotations from introductions just for the sake of making more annotations, though.

ValWood commented 4 years ago

For these it would either be TAS or nothing. So although we are looking at these specific genes we are not necessarily curating the papers which characterized the genes role (I.e catalytic activity) for one of the species. However the role is usually clearly stated by the authors in the intro.

We can't go on to curate these papers, because the grant is aimed at pathogen- host interaction phenotypes (and we decided it would be easy to do GO for these at the same time, because it is built into our tool).

The advantage for GO is that these papers will provide a link to the experimental papers, We aren't consistently checking if the annotation is already made in GO, so there could be redundancy- maybe we should do this. Often these things are unannotated though if I have looked.

It might be best if we only submit the experimental annotations for now. Later if/when GO applies filtering for TAS we could submit and then they would only be visible if there is an annotation gap.

@jseager7 for PHI-Base more generally, you could import the GO annotations for the genes of interest to PHI-BAse.

kimrutherford commented 4 years ago

any annotation should have the annotated gene taxon PLUS the taxon of the species that the interaction is with.

I'm going to need help with that. Are we exporting interactions or GO annotations? I'm missing something here.

ValWood commented 4 years ago

sorry I wasn't very clear

from http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/#annotation-extension-column-16 Taxon (column 13) Taxonomic identifier(s). For cardinality 1, the ID of the species encoding the gene product. For cardinality 2, to be used only in conjunction with terms that have the biological process term multi-organism process or the cellular component term host cell as an ancestor. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction

Here "interaction" refers to "pathogen host interaction"
"GO:0044419 JSON interspecies interaction between organisms" not to a physical interaction.

The GO docs should probably be updated to make this clearer? @pgaudet @vanaukenk

ValWood commented 4 years ago

@vanaukenk Note, our GPAD ticket will be implemented over the next couple of months ;) https://github.com/pombase/pombase-chado/issues/276 (Opened on Dec 11 2013!)

ValWood commented 4 years ago

In fact it's even older it came from the previous. Have we really been talking about GPAD for a decade?

jseager7 commented 4 years ago

Here "interaction" refers to "pathogen host interaction" "GO:0044419 JSON interspecies interaction between organisms" not to a physical interaction.

@ValWood Does Canto allow GO annotations that refer to two organisms? This implies GO annotations to a metagenotype, but which annotation type should we use for this, if not physical interaction?

ValWood commented 4 years ago

Ah good point.

It would be used for any GO annotation in this branch GO:0044419. I was thinking that we could infer it, but we can't, because we often have multiple host/pathogens per session.

So for example, if we were annotating a symbiont gene A to GO:0140423 effector-mediated suppression of pattern-triggered immunity signaling we would select the plant 'host species'

Although this isn't enforced by GO These GO annotations will be infinitely more useful if the pathogen process is coupled to the host in which it is residing. Without this all we know it is "some plant" because PTI is a plant-specific process. Depending on the term we might not even be able to know if the host is a plant or a human!

The easiest way to include this is using an annotation extension, which should then require minimal coding, (may even be configurable?)

with_host_species with_symbiont_species Depending whether the gene product being annotated is symbiont or host.

Then allow selection of the species by populating the field with the session species list. Instead of reporting this as an annotation extension in column 17 use it to populate the taxon column.

jseager7 commented 4 years ago

Okay, the one problem I can see with the above is that annotation extensions don't currently support a Taxon ID as a range type, so that would need changing. In fact, we might need additional restrictions so that only pathogen taxon IDs are suggested for with_symbiont_species and only host taxon IDs are listed for with_host_species. That implies two new range types, which I'll tentatively name PathogenID and HostID.

We could work around this in the short term by setting the range type to free text, and typing in the taxon ID manually.

Here's what the config might eventually look like, assuming we add the new range types:

domain ID subset relation extension relation range ID Canto display text Help text cardinality role
GO:0044419 is_a with_symbiont_species PathogenID with symbiont species 0,1 user
GO:0044419 is_a with_host_species HostID with host species 0,1 user
jseager7 commented 4 years ago

What I've said above also assumes you actually want to add the pathogen or host (on the other side of the interaction) as an organism to the session. I'm assuming that it's not a problem for you to add an organism solely to have its taxon ID available for a GO annotation extension (meaning the organism won't necessarily be used for genotypes, metagenotypes, etc).

If you'd rather not do that, then maybe free text would be a better option? Maybe with some constraints to keep the value numeric, at least.

ValWood commented 4 years ago

Here's what the config might eventually look like, assuming we add the new range types:

config constrained for host or pathogen as above would be perfect .

What I've said above also assumes you actually want to add the pathogen or host (on the other side of the interaction) as an organism to the session.

It won't be a problem to add an organims soley for use in the extension in GO, but in 90% of cases it will already be in the session for the pathogen host interaction phenotypes

jseager7 commented 4 years ago

@kimrutherford How difficult is it going to be to add new range types for pathogen and host taxon IDs, as described above?

kimrutherford commented 4 years ago

How difficult is it going to be to add new range types for pathogen and host taxon IDs, as described above?

It's a bit of work as the extension constraint code is complicated. It handles a lot of cases already.

A quick fix is to use "Number" as the Range ID in the configuration.

jseager7 commented 4 years ago

@ValWood are you okay with just using unvalidated numbers in the short term?

Also, can you think of a way to make it clear to the curator that we're expecting a NCBI taxonomy ID, and not some other ID scheme? Currently, I can't see any way for Canto to indicate anything beyond the data type it expects. I've noticed that the annotation extension configuration has a slot for help text that seems currently unused – might this help?

Otherwise, we could be verbose and name the annotation extension unambiguously, for example 'with symbiont species (NCBI Taxonomy ID)'.

ValWood commented 4 years ago

Also, can you think of a way to make it clear to the curator that we're expecting a NCBI taxonomy ID, and not some other ID scheme? Currently, I can't see any way for Canto to indicate anything beyond the data type it expects. I've noticed that the annotation extension configuration has a slot for help text that seems currently unused – might this help

Can't we just configure the list to show the species already names in the session and convert to taxon ID behind the scenes?

jseager7 commented 4 years ago

Can't we just configure the list to show the species names already in the session and convert to taxon ID behind the scenes?

That would be best, but Kim was saying that fetching the organism names was going to be more difficult than having the curator simply enter a number.

ValWood commented 4 years ago

Ah I see I think @kimrutherford said this is more complicated. It's OK, we can just wait until it can be implemented. It isn't mandatory (although probably should be) and it will be quick for me to go into the completed sessions and add it once available

kimrutherford commented 4 years ago

'with symbiont species (NCBI Taxonomy ID)'.

That seems fine to me. Later on we can implement a drop-down of host or pathogen organisms. But this would work for now. Any large typos will be noticed when the Canto data is loaded into PHI-base.

I've just added TaxonID, HostTaxonID and PathogenTaxonID as aliases for Number in the extension configuration so we can start using them where it makes sense.

kimrutherford commented 4 years ago

Add an option to canto_export.pl to only export approved sessions when using the gaf export mode, analagous to the --dump-approved option for the canto-json export type.

I've changed the GAF exporter to obey the --dump-approved flag so now you can do:

./script/canto_export.pl gaf --annotation-type=molecular_function --dump-approved > output.gaf
kimrutherford commented 4 years ago

has_qualifier(PBHQ:002) currently in column 17 should be in the qualifier column (4) and should resolve (to either NOT, contributes_to or colocalizes_with)

That's done now. I've exported the GO annotation from approved session from the pombe Canto as a test. Please let me know if you spot anything dodgy:

https://send.firefox.com/download/acaf81db57c043d1/#JOcpMp1V6h5p8IE057j2JQ

ValWood commented 4 years ago

Looks good to me! I did find an issue, but I will open a ticket for that (it identified an existing problem!)

ValWood commented 4 years ago

It is an issue here. Basically, don't export "residue(4-63)" annotations, these aren't GO compliant.

mf.gaf:PomBase SPBC27B12.02 mis19 GO:0005515 PMID:24774534 IPI PomBase:SPCC970.12 F centromere protein Mis19/Eic1 eic1|kis1|SPBC30B4.10 protein taxon:4896 20140605 PomBase residue(4-63)

There are only a few of them.

ValWood commented 4 years ago

Also

a) there is a blank line in the file.

b) you are using 'protein' for the annotated object. mf.gaf:PomBase SPNCRNA.1709 SPNCRNA.1709 GO:0030566 PMID:28432181 IMP F U2 snRNA-specific guide RNA proteintaxon:4896 20170707 PomBase part_of(GO:0031120)

which should be ncRNA for RNAs

c) In the line above there does not seem to be a space between protein and taxon? (but when I grep for "proteintaxon" I don't get anything so maybe it is fine?)

Otherwise all the content seems fine to me.

kimrutherford commented 4 years ago

Basically, don't export "residue(4-63)" annotations, these aren't GO compliant.

Will you be using residue() in PHI-Canto sessions?

ValWood commented 4 years ago

nope

ValWood commented 4 years ago

I don't think so anyway...

kimrutherford commented 4 years ago

I think all the problems you found won't be OK for PHI-Canto? In that case I think we can leave this as low priority.

kimrutherford commented 3 years ago

I think the next step is to export the GO annotation from some approved sessions and submit a GAF file to GO. Is there anything that needs fixing before trying that?

jseager7 commented 3 years ago

Is there anything that needs fixing before trying that?

@kimrutherford I think we still need to include more taxon IDs for annotations which are children of GO:0044419. I've copied the relevant sections from the above comments below:

Taxon column. For most GO terms only one taxon ID is required. However, for GO terms in the "GO:0044419 JSON interspecies interaction between organisms" branch (i.e this term and ALL descendants), any annotation should have the annotated gene taxon PLUS the taxon of the species that the interaction is with.

For GO:0044419 'interspecies interaction between organisms' and its children, include the taxons for both species of the interaction in the Taxon colum (13), separated by a pipe. The first taxon ID should be that of the organism encoding the gene or gene product, and the taxon ID after the pipe should be that of the other organism in the interaction (see GO documentation).

ValWood commented 3 years ago

I think we might need an option in the GO workflow (species drop down) to select the additional taxon in the host /or pathogen if there is more than one host/pathogen in the session...

jseager7 commented 3 years ago

Just noticed an earlier comment where I described the configuration for the annotation extensions: https://github.com/pombase/canto/issues/2255#issuecomment-606510923, so it looks like it's up to me to implement this now.

I think the initial implementation will have curators enter the taxon IDs as numbers (see here), and later on we can add the drop-down of species.

ValWood commented 3 years ago

Does the data pulled in with the uniprot entry store the taxon ID? If not this should be possible to add, since we get species from here? Then a species drop down could easily be converted to a taxon ID.

jseager7 commented 3 years ago

It's possible, but more difficult than just entering numbers. There's previous discussion about this starting here: https://github.com/pombase/canto/issues/2255#issuecomment-607703367. If you think that entering taxon IDs manually is going to be too awkward, then we can wait until we've implemented a species drop-down here before enabling the extensions.

ValWood commented 3 years ago

It will be awkward. It will always involve going to look up the taxon ID (and that isn't as easy as it should be).

This isn't urgent because dual taxon isn't mandatory in the GO file for these terms (although I am arguing for it to be so!).

So we can wait.

jseager7 commented 3 years ago

I've started work on adding drop-downs to select organisms from the session by their scientific names. I'll let you know when it's ready to test.

ValWood commented 3 years ago

are you trying to speak to us @tberardini ? ;)

tberardini commented 3 years ago

Sorry about that - early morning glitches - too much multi-tasking.

jseager7 commented 3 years ago

The changes are finished now (thanks to help from Kim Rutherford), so Canto now supports selecting organisms in the session by their scientific name. This isn't visible on the PHI-Canto servers yet, since (I think) it needs the ontologies to be reloaded first, but here's how it should look:

image

The next step will be to export these additional taxon IDs as part of the GAF file.

ValWood commented 3 years ago

Ooh, I see some PHI-base GO annotation in GOA! https://www.ebi.ac.uk/QuickGO/annotations?goUsage=descendants&goUsageRelationships=is_a,part_of,occurs_in&goId=GO:0140415

jseager7 commented 3 years ago

I've opened new issues on the relevant trackers (linked above) to cover the remaining syntax errors in the GAF file. After they're all fixed we can submit again.

jseager7 commented 3 years ago

All the GAF issues have been fixed and we've contacted the EBI to get our GO annotations loaded into the GOA database.