The input files we're currently using to build the domain database often have multiple overlapping domains on a single protein isoform. We'd like to require that each residue in a protein be mapped to a unique domain (if any), so we need to do some preprocessing on these files before we give them to Biosurfer.
Current low-effort idea:
Given all domain names, discard names that contain other names as prefixes. (Ex: discard zf-C2H2_met in favor of zf-C2H2.)
Build graph of all domain mappings where each edge represents an overlap between two mappings on the same isoform.
For each connected component of the graph, select the mapping with the best e-value and discard the rest.
The input files we're currently using to build the domain database often have multiple overlapping domains on a single protein isoform. We'd like to require that each residue in a protein be mapped to a unique domain (if any), so we need to do some preprocessing on these files before we give them to Biosurfer.
Current low-effort idea:
zf-C2H2_met
in favor ofzf-C2H2
.)