pantherdb / fullgo_paint_update

Update of Panther and PAINT DBs with monthly GO release data
0 stars 0 forks source link

Report stats comparing PAINT release versions #43

Open dustine32 opened 4 years ago

dustine32 commented 4 years ago

Add some stats to the standard update pipeline reports comparing changes between two versions of the PAINT release (i.e. the IBD file and the set of IBA GAFs). Ideally, the parameters should just be two dates corresponding to before and after releases (e.g. 2020-01-31 and 2020-03-26).

We already have two reports yet to be committed to this repo:

  1. A simple SQL query to count IBDs created between the two parameter dates, split out by curator. Example result of comparing 2020-01-31 vs 2020-03-26:
name count
Pascale Gaudet 97
Huaiyu Mi 978
Marc Feuermann 2153
Michael Kesling 884
Total 4112
  1. A python script that works only with the contents of our monthly releases posted on our FTP server. It compares sets of IBDs from the IBD.gaf files and cross-references to IBAs through the PANTHER:PTN in the IBA's with/from column.

Further description of the stats the python script calclulates:

  1. Added IBDs - Given two IBD/IBA sets, "before" and "after", find the IBDs in "after" that aren't in "before".
  2. Obsoleted IBDs - Now find IBDs in "before" that aren't in "after"
  3. Added IBAs - In the "after" set of IBA GAFs, count all IBAs that reference IBD PTN and term in Added IBDs
  4. Obsoleted IBAs - In the "before" set of IBA GAFs, count all IBAs that reference IBD PTN and term in Obsoleted IBDs
  5. Net IBA change = Added IBAs - Obsoleted IBAs

When running the script on "before" release 2020-01-31 and "after" release 2020-03-26 I get these numbers:

Added IBDs: 4062 Obsoleted IBDs: 1224 Added IBAs: 319,250 Obsoleted IBAs: 71,491 Net IBA change: 247,759

A third report displaying the % change by individual IBA GAF (e.g. paint_mgi, paint_human) as well as overall % change in IBA count will be added.

These reports will help quickly QA and identify potential data issues that would've then got out to the GO release data.

pgaudet commented 4 years ago

Thanks @dustine32, this is great !

The stats will be here ? https://drive.google.com/drive/folders/1MrtIQVmtdfd6gJhVcEfofXrU0IIPnOW7

And I guess this will be a new file in the next release?

dustine32 commented 4 years ago

@pgaudet Right, it'll go in that folder, prefixed with the run date, e.g. 2020-03-26-[report_name]. Since these are sort of global update stats we can probably just call the new report 2020-03-26_update_stats? What do you think?

pgaudet commented 4 years ago

Sounds good.

dustine32 commented 4 years ago

Add "Net IBD change" count