Closed ugurdogrusoz closed 2 days ago
This is not fully complete, but here is the current version working on the same query:
It removes a significant amount of processes that have very basic structure, which can be seen on the left side of this picture (the only query type that seems to generate a huge number of those is the neighborhood query). As a result, the structure of the remaining graph seems to be a little weird. Do we perform layouts on these queries? Maybe we should do that after the deletions.
PS. The changes are now deployed on the internal server. Please check them. Here are the things that are not done yet:
So, here is what I decided to do. For every process, I calculate a string that uniquely determines the neighborhood of the process. Here is what I do for every process on the map:
Let me do an example. Here is an example process:
Arranging the nodes in lexicographical order we get ["AB", "BC", "CD"]. Iterating over this we first append ABconsumption
and then append BCinhibition
and then append CDproduction
to the descriptorString. We have descriptorString = ABconsumptionBCinhibitionCDproductionprocess
as the final string. We check for this string in the hashmap.
Let me know if this makes sense / is a good approach. There are also a couple of things that I will address. First of all, this obviously discards parent information. I will add that also to the desriptorString. There also exists an edge case. Consider this:
In the code, the sort algorithm doesn't distinguish between the edges if the nodes are the same. So, both ABconsumptionABinhibitionCDproductionprocess
and ABinhibitionABconsumptionCDproductionprocess
are valid descriptors for this neighborhood. I will fix this by also including the edge class as a tie-breaker in the sort algorithm.
PS. The code for this is on the unstable branch but not yet on the internal server. Please rebuild if you want to try these changes on the internal server. Also, I am including the code for the described algorithm below:
appUtilities.removeDuplicateProcessesAfterQuery = function() {
var cy = appUtilities.getActiveCy();
var chiseInstance = appUtilities.getActiveChiseInstance();
var processes = cy.filter('node[class="process"],[class="omitted process"],[class="uncertain process"],[class="association"],[class="dissociation"]');
let processMap = new Map();
var deletion = cy.collection();
processes.forEach( (process) => {
let collectionArray = [];
let neighborhoodDescriptorString = "";
var edges = process.connectedEdges();
edges.forEach( (edge) => {
var node = edge.connectedNodes().difference(process);
collectionArray.push(node.union(edge));
})
collectionArray.sort( (first, second) => {
var firstLabel = first.filter("node").data("label") || first.filter("node").data("class");
var secondLabel = second.filter("node").data("label") || second.filter("node").data("class");
return firstLabel.localeCompare(secondLabel);
});
collectionArray.forEach( (item) => {
neighborhoodDescriptorString += (item.filter("node").data("label") || item.filter("node").data("class"));
neighborhoodDescriptorString += (item.filter("edge").data("label") || item.filter("edge").data("class"));
})
neighborhoodDescriptorString += process.data("class");
if(processMap.has(neighborhoodDescriptorString)){
deletion.merge(process);
}
else{
processMap.set(neighborhoodDescriptorString, true);
}
});
chiseInstance.deleteElesSimple(deletion);
}
This is along the lines I was thinking. Coding each process can be done more efficiently I think by simple mappings (e.g. 1 for consumption edges, 2 for product edges, etc. or 1 for macromolecules, 2 for simple chemicals, etc.). A process can be defined by its type, followed by its substrates (in lexicographic order), followed by its products, followed by its effectors.
This should be ready for review. It is on the internal server.
I have done some tests and removed the time limit on queries (30 seconds) to get the 1-neighborhood of ANK1 ANK2 ANK3, which returned at around 60 seconds (note that this is only the API response and doesn't include the time to render the graph). It has ~650 processes. The redundancy removal took something like 650 milliseconds with the current algorithm, which should be adequate. Note that our software can't even properly edit a graph of this size (the software slows down significantly).
Well done @umut-er !
Since Pathway Commons is a database with contents from several prominent pathway databases, there are often redundancies in the data. Hence, we would like to give the user a chance to not see these. Let's add a new option
Remove Redundant Processes
in query dialogs, which istrue
by default. When true we would fine and delete redundant processes from the query result before laying out and presenting it to the user. Examples from 1-neighborhood of ANK1, four instances of the same process:Here are the rules for deciding whether or not two processes are redundant:
This should be done before removal of the disconnected nodes if enabled.