Remove redundant processes from query results

ugurdogrusoz commented 4 days ago

Since Pathway Commons is a database with contents from several prominent pathway databases, there are often redundancies in the data. Hence, we would like to give the user a chance to not see these. Let's add a new option Remove Redundant Processes in query dialogs, which is true by default. When true we would fine and delete redundant processes from the query result before laying out and presenting it to the user. Examples from 1-neighborhood of ANK1, four instances of the same process:

Here are the rules for deciding whether or not two processes are redundant:

The type and compartments of the two processes are the same, and
The number, types, compartments, and labels of all their (the two processes') substrates (input), products (output), and effectors (modulation, stimulation, catalysis, etc.) are the same.

This should be done before removal of the disconnected nodes if enabled.

umut-er commented 3 days ago

This is not fully complete, but here is the current version working on the same query:

It removes a significant amount of processes that have very basic structure, which can be seen on the left side of this picture (the only query type that seems to generate a huge number of those is the neighborhood query). As a result, the structure of the remaining graph seems to be a little weird. Do we perform layouts on these queries? Maybe we should do that after the deletions.

PS. The changes are now deployed on the internal server. Please check them. Here are the things that are not done yet:

Paths by URI query currently has no access to this functionality.
This might delete more nodes than it should because it doesn't do very detailed checking for equality (it only compares labels). The biggest problem is probably that current algorithm doesn't check the parents.

umut-er commented 3 days ago

So, here is what I decided to do. For every process, I calculate a string that uniquely determines the neighborhood of the process. Here is what I do for every process on the map:

Start with an empty hashmap. Arrange all the nodes that connect to the process in lexicographical order (this is the most expensive operation in this list)
Start with descriptorString = "". Iterate over the array constructed in step 1. Every time, add the label of the node and the class of the edge connecting that node to the process to the descriptor string. At the end, append the class of the process.
Check if this string exists in the hashmap. If it doesn't, add it. If it does, add this process to the deletion list.
Remove all the items in the deletion list.

Let me do an example. Here is an example process:

Arranging the nodes in lexicographical order we get ["AB", "BC", "CD"]. Iterating over this we first append ABconsumption and then append BCinhibition and then append CDproduction to the descriptorString. We have descriptorString = ABconsumptionBCinhibitionCDproductionprocess as the final string. We check for this string in the hashmap.

Let me know if this makes sense / is a good approach. There are also a couple of things that I will address. First of all, this obviously discards parent information. I will add that also to the desriptorString. There also exists an edge case. Consider this:

In the code, the sort algorithm doesn't distinguish between the edges if the nodes are the same. So, both ABconsumptionABinhibitionCDproductionprocess and ABinhibitionABconsumptionCDproductionprocess are valid descriptors for this neighborhood. I will fix this by also including the edge class as a tie-breaker in the sort algorithm.

PS. The code for this is on the unstable branch but not yet on the internal server. Please rebuild if you want to try these changes on the internal server. Also, I am including the code for the described algorithm below:

appUtilities.removeDuplicateProcessesAfterQuery = function() {
  var cy = appUtilities.getActiveCy();
  var chiseInstance = appUtilities.getActiveChiseInstance();

  var processes = cy.filter('node[class="process"],[class="omitted process"],[class="uncertain process"],[class="association"],[class="dissociation"]');
  let processMap = new Map();
  var deletion = cy.collection();
  processes.forEach( (process) => {
    let collectionArray = [];
    let neighborhoodDescriptorString = "";
    var edges = process.connectedEdges();
    edges.forEach( (edge) => {
      var node = edge.connectedNodes().difference(process);
      collectionArray.push(node.union(edge));
    })
    collectionArray.sort( (first, second) => {
      var firstLabel = first.filter("node").data("label") || first.filter("node").data("class");
      var secondLabel = second.filter("node").data("label") || second.filter("node").data("class");
      return firstLabel.localeCompare(secondLabel);
    });

    collectionArray.forEach( (item) => {
      neighborhoodDescriptorString += (item.filter("node").data("label") || item.filter("node").data("class"));
      neighborhoodDescriptorString += (item.filter("edge").data("label") || item.filter("edge").data("class"));
    })
    neighborhoodDescriptorString += process.data("class");
    if(processMap.has(neighborhoodDescriptorString)){
      deletion.merge(process);
    }
    else{
      processMap.set(neighborhoodDescriptorString, true);
    }
  });

  chiseInstance.deleteElesSimple(deletion);
}

ugurdogrusoz commented 3 days ago

This is along the lines I was thinking. Coding each process can be done more efficiently I think by simple mappings (e.g. 1 for consumption edges, 2 for product edges, etc. or 1 for macromolecules, 2 for simple chemicals, etc.). A process can be defined by its type, followed by its substrates (in lexicographic order), followed by its products, followed by its effectors.

umut-er commented 2 days ago

This should be ready for review. It is on the internal server.

I have done some tests and removed the time limit on queries (30 seconds) to get the 1-neighborhood of ANK1 ANK2 ANK3, which returned at around 60 seconds (note that this is only the API response and doesn't include the time to render the graph). It has ~650 processes. The redundancy removal took something like 650 milliseconds with the current algorithm, which should be adequate. Note that our software can't even properly edit a graph of this size (the software slows down significantly).

ugurdogrusoz commented 2 days ago

Well done @umut-er !

iVis-at-Bilkent / newt

Remove redundant processes from query results #722