reimandlab / ActiveDriverDB

ActiveDriverDB
GNU Lesser General Public License v2.1
12 stars 3 forks source link

Add PTM sites import pipeline #140

Closed krassowski closed 5 years ago

krassowski commented 6 years ago

This changeset adds built-in PTM pipeline which draws sites from multiple databases, maps sites to alternative isoforms (using exact match of +/-7 amino acid sequence span) and verifies data consistency.

There are predefined site importers (descendants from SiteImporter class) for four supported databases: PhosphoSitePlus, HPRD, Phospho.ELM and UniProt. Files (or database dumps) for those databases have to be provided by the user.

All site importers are defined as classes in Python modules in the imports.sites package and all come with tests (test_imports.ptm_sites).

To import sites from a given database following script shall be used:

./manage.py load protein_related -i ImporterName

Names of the provided importers are: HPRDImporter, OthersUniprotImporter, GlycosylationUniprotImporter, PhosphoSitePlusImporter, PhosphoELMImporter.

As UniProt divides PTM sites into four categories:

there are two importers for sites retrieved from UniProt (for the second and the last category).

For backward compatibility, it is still possible to import sites from a file generated with reimandlab/PTMvar scripts using additional importer: PTMVarImporter, though this one is not exposed as it should be considered deprecated.

There are various major maintenance-related changes in this pull-request, including:

The last group of commits in this PR represents addition of Venn diagrams generation which enable quick assessment of the "added value" of various databases (how much entities are unique/shared among databases).