pombase / pombase-chado

PomBase code for accessing Chado
MIT License
5 stars 3 forks source link

Make a interactions file for BioGRID after each release #411

Open pombase-admin opened 9 years ago

pombase-admin commented 9 years ago

This should be automated somehow. All new interaction annotations will come from Canto. That may help as those annotations always have dates attached.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

Add a query after each load that reports duplicates for symmetrical interaction types including binding annotations + IPI + with.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

As a first step I'm adding a check for duplicate interactions. I must be misunderstanding things as there seem to be a lot of duplicates: https://www.dropbox.com/s/qz3n89tsiasoz9c/chado-load-warnings-2015-02-17.txt?dl=0

In that file - "already exists: ..." is a straightforward duplicate.

"already exists for symmetrical relation: ..." means the relation is symmetrical and there's a duplicate with the bait/prey (subject/object) swapped around.

I'll check for the binding/with duplicates separately.

(I thought we had another ticket for this but I can't find it)

Original comment by: kimrutherford

pombase-admin commented 9 years ago

Also an interactions file gets created after each load containing only interactions added since the previous release: http://curation.pombase.org/dumps/latest_build/exports/pombase-interactions-since-v49-2015-02-02.gz

Which will help for next time.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

Something does seem a little odd here.

PMID:22681890 is Colm Ryan's paper, and we didn't curate that one as fas as I know. All if the interactions should be BIOGRID derived. Which means either we are duplicating these annotations somehow, or they are providing them in duplicate.

We can discuss later....

Original comment by: ValWood

pombase-admin commented 9 years ago

There are 4563 annotation in this file, but omitting PMID:22681890 reduces to 499, so if we find out what is causing this one we'll be 90% of the way there.....

Original comment by: ValWood

pombase-admin commented 9 years ago

BioGRID often fix things, will corrections made subsequently get picked up doing it this way?

Original comment by: ValWood

pombase-admin commented 9 years ago

PMID:22681890 is Colm Ryan's paper, and we didn't curate that one as fas as I know

Yep, all those come from BioGRID. It looks like they are providing duplicates. As a test I searched for SPBC317.01 in the BioGRID data file and got: https://www.dropbox.com/s/516xz0ul2vi29kc/mbx2_interactions.txt?dl=0

In that example each Positive Genetic interaction is listed in both directions. The Negative Genetic interactions are in just one direction. Does that make sense?

Here it is at the source: http://thebiogrid.org/276879/summary/schizosaccharomyces-pombe/mbx2.html

Original comment by: kimrutherford

pombase-admin commented 9 years ago

BioGRID often fix things, will corrections made subsequently get picked up doing it this way?

The load script downloads the latest BioGRID release whenever there is a new version.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

E-mailed BioGRID for clarification....

Original comment by: ValWood

pombase-admin commented 9 years ago

I've added a new database check that looks at the interactions.

http://curation.pombase.org/dumps/builds/pombase-build-2015-02-21-v2-l1/logs/log.2015-02-22-20-58-07.chado_checks

Lines like "already exists: ..." are cases where an annotation in duplicated.

Lines like "missing annotation for: ..." are for missing reciprocal annotations.

The missing reciprocals can be ignored for now as I'll be fixed the load code soon to automatically add the reciprocals.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

So in both cases I presume we don't need to do anything?, except to filter our "already exists" when we submit to BioGRID.

Although I am now confused because you say"The missing reciprocals can be ignored for now as I'll be fixed the load code soon to automatically add the reciprocals"

I had assumed that we were only adding the reciprocals for GO protein binding as Mark already does the inferences for BioGRID. I am happy for it to be done this way for both though if it is easier/more consistent.

Am I correct that curators don't need to do anything here?

Original comment by: ValWood

pombase-admin commented 9 years ago

I think we need to look at the "already exists" ones because in those cases there are two identical annotations. Usually one is from Canto and one is from BioGRID.

An example is for byr4: http://www.pombase.org/spombe/result/SPAC222.10c#interactionPhysical

We have this annotation twice: forms complex with spg1 GTPase Spg1 Reconstituted Complex Furge KA et al. (1998)

I could change the loader just to ignore the Canto ones and keep the BioGRID ones in Chado. I think that's better than dropping the BioGRID annotations because if you make a new, duplicate annotation in Canto we don't want it in Chado because we don't want to send it to BioGRID when we send them an update. Does that make sense?

I had assumed that we were only adding the reciprocals for GO protein binding as Mark already does the inferences for BioGRID. I am happy for it to be done this way for both though if it is easier/more consistent.

Sorry, I should have made a comment about that. Mark and I had a chat and decided to put the reciprocal for the symmetrical interactions in Chado when loading. That will make it consistent with the GO protein binding case.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

I'm a little confused still (but less so).

There are 425 already exists annotations.

I can see that we could easily make the opposite annotation for a symmetrical evidence code. However a lot of these are for assymetrical codes.This implies one of us has curated in the wrong direction?

but I just checked and this session the annotation only seem to appear in the correct direction: http://curation.pombase.org/pombe/curs/4650423a1b7a3d16

Original comment by: ValWood

pombase-admin commented 9 years ago

Ah I see this is just a straight duplicate. Otherwise it would be "already exists for symmetrical relation

Ok this just means we curated it and BioGRID did too. Thats a shame, but it won't happen so often with frequent updates and when biogrid have access to the list of papers curated.

We need to delete these, but I don't really want to lose the community attribution. I wonder of we could somehow merge? i.e class as a biogrid annotation but if it was created in duplicate by a member of the fission yeast community (essentially annotation confirmed) keep this curator attribution within Pombase so that their name will still be attached in their sessions? (no need to export the attribution)

?

Original comment by: ValWood

pombase-admin commented 9 years ago

Ah I see this is just a straight duplicate. Otherwise it would be "already exists for symmetrical relation

Yep! Sorry I wasn't clear about that.

I wonder of we could somehow merge?

We can do that by keeping the annotation source as "BioGRID" (as opposed to "PomBase"), but add the Canto details (date, author and session ID). I've made a ticket about that: https://sourceforge.net/p/pombase/chado/452/

Original comment by: kimrutherford

pombase-admin commented 9 years ago

The reciprocal annotations are now created automatically. I'm running a full load to test.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

Have we heard back from BioGRID about how they handle symmetrical relations?

Original comment by: kimrutherford

pombase-admin commented 9 years ago

do you mean asymmetric ones (i.e Colm's paper). I am still waiting for a rely on that

v

Original comment by: ValWood

pombase-admin commented 9 years ago

do you mean asymmetric ones (i.e Colm's paper). I am still waiting for a rely on that

Probably then we should send them what we think is right and they can let us know if it doesn't work for them.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

This is what Val got from Jennifer Rust, which is probably enough to be getting on with:

Hi Val, Yes, we are using a "spoke" model to enter information so we only report interactions in a single direction and more specific guidelines on how we define the bait and the hit in an interaction can be found here: Direction of interactions (Bait/Hit) http://wiki.thebiogrid.org/doku.php/curation_guide:direction_of_interactions.

The annotations are not automatically reversed in our system. For instance if bait x:hit y are shown to interact by Affinity Capture Western and our curators capture that our system will not automatically add a bait y:hit x interaction by Affinity Capture Western. There will only be a single interaction for x and y that the user can see by querying either x or y.

It sounds like there may be a unique issue going on for the specific paper you mentioned PMID:22681890. Usually on datasets that large we talk with the researcher directly and automatically upload the data they provide so hopefully the curator was in touch with Colm Ryan and can give me some more info on why the interactions were added this way. I will look into it and get back to you ASAP.

Original comment by: kimrutherford

pombase-admin commented 9 years ago

We may need to change the file format. From Jennifer Rust:

I am attaching a copy of the file we ask users to fill out when they submit interactions directly to us. This file is formatted for easy upload into our system and if your export script produced a file with this format it might streamline the process of upload so that we could eventually automate it. It is not much different from the files you have sent previously but there are some columns in this file that are not in the data files you sent. For example, there is a phenotype column that must be populated for genetic interactions (we currently use the YPO) although we are working to expand the ontologies we can use.

The file she sent is now in Dropbox:

Dropbox/pombase/Chado/interactions/BioGRID-data-submission-spreadsheet.xls https://www.dropbox.com/s/vs5fapbjndrnfov/BioGRID-data-submission-spreadsheet.xls?dl=0

Original comment by: kimrutherford

pombase-admin commented 9 years ago

We should go ahead with the next exchange with the format you are working on (unless it is very quick to implement).

The phenotype column will be necessarily blank for the foreseeable future. They will need to fill this in....

I envisage that once we have multigene phenotypes up and running, we can somehow add a step to collect the BIOGRID evidence (if it cannot be inferred, from the combination of allele type and phenotype term) and dump the GI input section to reduce duplication.

Original comment by: ValWood

kimrutherford commented 5 years ago

Since we don't make releases any more we'll need to decide when to send updates. The original plan was to send the interactions since the last release, each time we make a release.

ValWood commented 5 years ago

Probably every month or every 2 months would be good longer term.

Antonialock commented 5 years ago

I just received this email from Jennifer at BioGRID:

"Also, I know you already have a request in for a file covering data from April 2017 until now from your developer. I'm not sure what the timeline for that looks like but we are meeting on Thursday Feb 21st and if there is any chance I could have the data file before that meeting it would be a great opportunity for me to keep our programmers primed to get it in quickly and hopefully reduce the chances that urgent projects will pop up and they will have to back burner regular data uploads to handle those."

ValWood commented 5 years ago

Also Li-Lin's help desk ticket (you can pass that to Midori, he has a number of issues, all of which we know about, but one is multi-gene HTP data, we have been punting that)

Antonialock commented 5 years ago

wrt Li-Lin I was going to suggest he liaise with biogrid to get the genetic interactions in & then he can also advise him on the annotations he is not happy with them displaying. I was also going to point him to the spreadsheet FYPO submission format, but am I correct in thinking that it doesn't support double mutants (and beyond)? @mah11

ValWood commented 5 years ago

He is really asking why we don't have a system in place to stop duplications. (I thought this was an issue he already reported and BIOgrid were informed).

Yes the spreadsheet format does not support double mutants yet... We might need to collect the interactions in Biogrid only and then curate the multi allele phenotypes later.

kimrutherford commented 5 years ago

I can make a file of interactions since April 2017 but we still need a plan for making a file at each release. I'll try to do that today.

kimrutherford commented 5 years ago

I can make a file of interactions since April 2017 but we still need a plan for making a file at each release.

I've made the file. There are 892 new interactions by PomBase since April 2017: interactions-since-2017-04-01.tsv.gz

Antonia could reply to Jennifer with this file?

Antonialock commented 5 years ago

Cheers Kim, I have passed the link on to Jennifer.

ValWood commented 5 years ago

This ticket is very long. @Antonialock could you check if there is still anything to do in this ticket other than establish regular updates to BIoGRID? If so can we open a new tickets for any different tasks.

What is still left to do with the regular releases? I'd like us to send BIOGRID an update because the community curated so many genetic interactions recently. I don't want BioGRID to duplicate the effort.

Antonialock commented 5 years ago

I sen the interaction file to Jennifer (interactions-since-2017-04-01.tsv.gz), only thing outstanding is to establish regular updates as far as I can see

ValWood commented 5 years ago

Yes BIOGRID announced inclusion on twitter yesterday!

ValWood commented 4 years ago

OK, BioGRID requested a new update file. KIm can you look at this in the next few weeks ? We can discuss anything related to this on the next calls. Thanks

Val

ValWood commented 1 year ago

At some point soon I will close this ticket and replace with an updated ticket to describe what needs to happen wrt to our new data. We probably need a meeting with BioGRID about this to find out precisely how they want the additional curation. Do you export a json file for BioGRID?

kimrutherford commented 1 year ago

Do you export a json file for BioGRID?

We export a TSV file with the columns BioGRID likes.