japonicusdb / japonicus-config

Configuration for JaponicusDB
0 stars 1 forks source link

mapping GO annotation for japonicus genes with paralogs #32

Closed ValWood closed 3 years ago

ValWood commented 3 years ago

We can use all of the GO annotation on the orthologs for Tables 1&3 here:

https://github.com/japonicusdb/japonicus-config/issues/27

Even though they are many to many etc.

Instead If we have a file equivalent to

pombe-embl/goa-load-fixes/filtered_mappings

I will add the Pombase IDs to be blocked to that file.

In fact, I have a question. Are we using this file of filters for japonicus ?(we should do). And filtered_GO_IDs

ValWood commented 3 years ago

Note: Once this is done I will ask people to look at the slims.

kimrutherford commented 3 years ago

In fact, I have a question. Are we using this file of filters for japonicus ?(we should do). And filtered_GO_IDs

We are, sort-of. When I created the JaponicusDB load script, I made copies of the pombe files because I didn't know if japonicus needed exactly the same IDs. So currently we're using these two files which are just copies of what was in the pombe files 2 months ago: https://github.com/japonicusdb/japonicus-config/tree/main/goa-load-fixes

Should I switch to using the pombe versions or are you going to need japonicus specific IDs in these files?

ValWood commented 3 years ago

Is it possible to use the PomBase versions. I think I will be able to filter things which only need filtering for japonicus (if we come across them) using the PomBase:identifier in the "with" field without affecting the PomBAse annotations.

I will test this with the functionally diverged proteins I identified where we don't want the PomBAse GO annotation to be transferred.

We can have a rethink if this ever doesn't work, but I think there should always be a way for me to suppress mappings uniquely from either system. This could change if data is ever PAINTED (because there will be no unique ID to filter if I only want to suppress an annotation from one species).

If that occurs we could add supplementary species-specific filter files.

So yes, please use the current PomBAse files.

kimrutherford commented 3 years ago

I changed to using the PomBase versions of the filter files (in pombe-embl/goa-load-fixes) a few days ago.

kimrutherford commented 3 years ago

We can use all of the GO annotation on the orthologs for Tables 1&3 here

I'll have to think about how to do that. Currently all the orthologs are loaded into Chado and then there is a processing step that (using the orthologs in Chado) to map the annotation to japonicus. We might need to give the orthologs from tables 1&3 their own reference/PubMed ID so that the annotation mapping code knows which orthologs to use.

ValWood commented 3 years ago

We don't need to use the info in tables 2&3. YOu can map across all the data from the ortholog (whatever the mapping type (many to many etc). I will filter any that we don't want to see using the pombe-embl/goa-load-fixes file.

Can discuss on PomBase call if not clear, but there were so few that we did not want to infer that it seemed a shame for you to need to implement some complicated mapping system when we already have a system in place to filter things we don't want.

ValWood commented 3 years ago

I tested this yesterday on http://japonicusdb.kmr.nz/gene/SJAG_00333 where I wanted to filter the annotations from P32464 so I added this ID to pombe-embl/goa-load-fixes file/filtered_mappings

It isn't blocked yet, but I may have added it too late, or perhaps it did not parse because it is not in the form Database:ID ?

kimrutherford commented 3 years ago

It isn't blocked yet, but I may have added it too late, or perhaps it did not parse because it is not in the form Database:ID ?

Yep it needs to be: UniProtKB:P32464

and add any replacements to the ???? file.

We don't have a manual GAF file for loading yet.

ValWood commented 3 years ago

durr, I was confused by the fact that the prefix wasn't on the web page, but we never display that! Anyway I will check it works with this e.g and use the same method to suppress off target GO annotations after import.

If there is any annotation to keep (i.e I don't want to filter ALL annotations for a particular ID, only a subset), I will make the appropriate annotation manually (this is what I have been doing with filters for years with PomBase). It is easy to manage in this way.

kimrutherford commented 3 years ago

YOu can map across all the data from the ortholog

Just to make sure: does that include the Compara orthologs?

ValWood commented 3 years ago

Any of the orthologs we are displaying, (rhind/GO/manual) all of the GO data can be transferred from fission yeast orthologs (using the same redundancy filtering pipeline).

I will check the slims make sense, and filter any of the erroneous transfer for the multi-gene families.

kimrutherford commented 3 years ago

I've had a go at this. It seems to be working correctly - all the annotation is transferred and then duplicates and less specific terms are removed. Let me know if you spot any problems. I'll leave this open for now.

ValWood commented 3 years ago

Yes, a good improvement and everything looks good. The numbers of mapped to slim are improved:

Screenshot 2021-07-13 at 12 33 49

and getting closer to PomBase:

Screenshot 2021-07-13 at 12 34 36

I would like to know how many GO annotations we have (after filtering for redundancy and the filters).

I would also like to know how many GO annotations we have (after filtering for redundancy and the filters) before we did any annotation transfer from POmBase orthologs.

I need this for the paper as I'd like to say how much non-redundant GO annotation there was before we added the value by inference from PomBase. I can't get this number unless you do a load without any inferring (it isn't the same number as in the file from GOA, as this is pre-filtering and is likely to be largest than the number we end up with after adding value , which would be confusing!

It might also be good to say how many mapped to slim categories before and after adding value, so the equivalent data for the number of genes which map to the slim that I posted above, (this is an easy way to show improvement of specificity and coverage).

There are only a few genes which do not get GO processes where I expected they would. I'll open a curation tracker ticket for these and work through them.

ValWood commented 3 years ago

If an annotation propagation from an ortholog is incorrect and comes from GOA it can be filtered using the filtered_mappings file i.e https://www.uniprot.org/uniprot/B6JVC6

If an annotation propagated from POMBAse (i.e post filtering), is propagated, I need to override this with a NOT annotation (and any replacements in the manual gaf).

So, a mechanism exists, no action required. Phew!