pantherdb / fullgo_paint_update

Update of Panther and PAINT DBs with monthly GO release data
0 stars 0 forks source link

Prevent leaf-specific IBA propagation if NOT qualifiers do not match #39

Open dustine32 opened 4 years ago

dustine32 commented 4 years ago

As explained in https://github.com/pantherdb/fullgo_paint_update/issues/30#issuecomment-549941317 I'll need to implement a check in the IBA generation script that blocks an IBA annotation to a leaf if that specific leaf has an experimental annotation with conflicting qualifier. Right now, I'll only check "NOT" vs "no qualifier" conflicts. I believe matching other qualifiers like "contributes_to" is still in discussion.

An example case is shown here: image The IBD on PTN000185192 is still valid and can be used to propagate to its other descendant leaf sequences, but the experimental NOT IGI annotation on PomBase:SPAC1B3.15c should block IBA propagation to this leaf.

Related tickets: https://github.com/geneontology/paint/issues/54 https://github.com/geneontology/go-annotation/issues/2378

dustine32 commented 4 years ago

@pgaudet I've implemented the IBA block for PAINT vs. exp NOT qualifier conflicts but have not yet pushed any new IBA files. I did a test run and generated a before/after report tracking IBA count differences.

Would you be able to spot-check this report for any unintended effects? What works for me is plugging the PTHR family and GO term into amigo and then looking for the NOT. Otherwise, I'm working on getting the actual list of to-be-dropped IBA lines (there are 269).

dustine32 commented 4 years ago

(Taking notes for myself)

For testing, I generated two sets of IBA GAFs (before and after code change) and ran these commands to get all dropped IBAs:

$ cat 2019-11-20_fullgo_test/IBA_GAFs/* > 2019-11-20_fullgo_test/all_IBAs
$ cat 2019-11-20_fullgo_test/preupdate_data/IBA_GAFs/* > 2019-11-20_fullgo_test/preupdate_data/all_IBAs
$ diff -u 2019-11-20_fullgo_test/preupdate_data/all_IBAs 2019-11-20_fullgo_test/all_IBAs | grep -E "^\-" > 2019-11-20_fullgo_test/dropped_IBAs_raw
$ grep -v "Created on" 2019-11-20_fullgo_test/dropped_IBAs_raw | grep -v "2019-11-20_fullgo_test" | sed 's/^-//' > 2019-11-20_fullgo_test/dropped_IBAs
$ wc -l 2019-11-20_fullgo_test/dropped_IBAs
324

Meaning 324 IBAs were dropped due to this code change. However, this number doesn't line up with the report, which says 269 lines were dropped. Spot-checking some of the lines having IBD PTNs not in the report (e.g. PTN001998491) I notice that these lines are in both before and after IBA files having no difference as far as I can tell (tried several diff options and looking for hidden characters). Guessing diff is playing tricks on me or something.

I can xref the report's IBD nodes to filter out lines that shouldn't be there.

pgaudet commented 4 years ago

Hi @dustine32

Do you mean that this script gets rid of the inferred NOT IBA here (from PTHR13271)?

image

I also checked PTHR10024 - it also seems OK.

Probably the way to be sure is if you exported the GAF for each of the impacted families - is that 'easy' ? Thanks, Pascale

dustine32 commented 4 years ago

@pgaudet Yep, that inferred NOT IBA should be removed by the code change due to its conflict with that positive IDA.

That's a great idea about just getting the GAFs for the impacted families. That might also clear up the weirdness I'm seeing trying to get an accurate diff of dropped lines.

dustine32 commented 4 years ago

@pgaudet Finally, I've got an accurate list of dropped IBAs for you to look at, though I used a mixed application of your idea to only output impacted families with my previous diff-ing and grep-ping attempts.

Basically, outputting all IBA GAFs for the IBD PTNs in the before/after report and then applying the diff/grep commands above gets me to the expected 269 count. This GAF file is uploaded to the google drive for your downloading convenience.

For your PTHR13271 peptidyl-lysine trimethylation (GO:0018023) example. Only one IBA was shown as dropped:

UniProtKB       Q86TU7  SETD3           GO:0018023      PMID:21873635   IBA     PANTHER:PTN000998435|ZFIN:ZDB-GENE-030131-9137  P       Histone-lysine N-methyltransferase setd3        UniProtKB:Q86TU7|PTN002491248   protein taxon:9606      20170228        GO_Central

But this one is positive (no NOT qualifier). I actually answered your question earlier without knowing the gene that the IBA in question was for, so... is this (UniProtKB:Q86TU7) your card (gene)?