Resolving overlaps between domain mappings

jsaquing commented 3 years ago

The input files we're currently using to build the domain database often have multiple overlapping domains on a single protein isoform. We'd like to require that each residue in a protein be mapped to a unique domain (if any), so we need to do some preprocessing on these files before we give them to Biosurfer.

Current low-effort idea:

Given all domain names, discard names that contain other names as prefixes. (Ex: discard zf-C2H2_met in favor of zf-C2H2.)
Build graph of all domain mappings where each edge represents an overlap between two mappings on the same isoform.
For each connected component of the graph, select the mapping with the best e-value and discard the rest.

jsaquing commented 3 years ago

This may not be as much of an issue if we're able to get SUPERFAMILY mappings as per Dr. Korkin's suggestion.

jsaquing commented 3 years ago

Ensembl provides Pfam mappings for GRCh38 that do not appear to overlap.

sheynkman-lab / biosurfer

Resolving overlaps between domain mappings #86