rationale for the max connected component and effects on embedding and prediction tasks

realmarcin commented 4 years ago

@pnrobinson @vidarmehr we are curious about the rationale for selecting the max component of the graph as input for the random walk. The bioNEV paper does not seem to do that. I believe that picking the max component dropped about 5k genes, so a bit. Presumably this will be dropping the less connected portions of the graph, but those are also harder for link prediction because more sparse. They may also be an effect on the negative sampling of edges, because some edges will have been dropped with < max components.

Along these lines, we are thinking of metrics to apply for the various steps in the pipeline so we can better track the data flow and effects.

vidarmehr commented 4 years ago

@realmarcin Hi Marcin, Here is a paragraph from node2vec paper:

In link prediction, we are given a network with a certain fraction of edges removed, and we would like to predict these missing edges. We generate the labeled dataset of edges as follows: To obtain positive examples, we remove 50% of edges chosen randomly from the network while ensuring that the residual network obtained after the edge removals is connected, and to generate negative examples, we randomly sample an equal number of node pairs from the network which have no edge connecting them.

So, that is why I chose the largest component of the graph. When I extracted protein-protein interactions, there were almost 200 components, in which the smallest size of the components was 2, ie. a subgraph with 2 nodes.

pnrobinson commented 4 years ago

It would be interesting to see exerimentallt if we can do link prediction on the entire graph and then rank the predictions according whether they are from the larger or smaller components of the graph. I would say we should try both !

Peter Robinson Professor and Donald A. Roux Chair, Genomics and Computational Biology The Jackson Laboratory for Genomic Medicine 860.837.2095 t | peter.robinson@jax.org | https://robinsongroup.github.io/ Peter Robinson

From: marcin joachimiak notifications@github.com Sent: Thursday, February 27, 2020 4:03 PM To: monarch-initiative/N2V N2V@noreply.github.com Cc: Peter Robinson Peter.Robinson@jax.org; Mention mention@noreply.github.com Subject: [EXTERNAL][monarch-initiative/N2V] rationale for the max connected component and effects on embedding and prediction tasks (#91)

@pnrobinsonhttps://github.com/pnrobinson @vidarmehrhttps://github.com/vidarmehr we are curious about the rationale for selecting the max component of the graph as input for the random walk. The bioNEV paper does not seem to do that. I believe that picking the max component dropped about 5k genes, so a bit. Presumably this will be dropping the less connected portions of the graph, but those are also harder for link prediction because more sparse. They may also be an effect on the negative sampling of edges, because some edges will have been dropped with < max components.

Along these lines, we are thinking of metrics to apply for the various steps in the pipeline so we can better track the data flow and effects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/monarch-initiative/N2V/issues/91?email_source=notifications&email_token=ABFW4PDXJJEPGTEAPAACBSTRFATAVA5CNFSM4K5DHHX2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQ5PSPA, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABFW4PFEKPHJ2ZHNOKZIIODRFATAVANCNFSM4K5DHHXQ.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

vidarmehr commented 4 years ago

@pnrobinson Yes. I agree, Peter! I will run link prediction on the entire graph, too.

vidarmehr commented 4 years ago

I have created training and test files for the entire graph, too. So, I close this issue.

monarch-initiative / embiggen

rationale for the max connected component and effects on embedding and prediction tasks #91