Closed CodingYuanLiu closed 2 years ago
Hi,
I think there are two problems with the execution: 1) For some of our experiments we have hard-coded the region into the Lambda client which you can see here. Change this to the region you are using otherwise it will try to launch lambdas in the wrong region.
2) It also seems that there is some problem with the graph server finding your graph files as you can see from the output. Make sure that the directory structure is exactly as described in the wiki in section 3 as the path is currently hard-coded. Also make sure that permissions on the directory are set correctly. Hopefully in a future update we can make the path less strict.
Hope this helps, and good luck!
Thank you very much for your reply! But there still exists some problems.
run/run-onnode
, and now it does not report "Can not open file..." errors.
However, it now stops quickly after log "Sync Epoch 1 starts...", reporting "Execution fails (at least on this node), check the log file".
The log file does not record any errors. The output file is empty (although the file is created at weight server).
Lambda still does not have logs in CloudWatch.Running GRAPH servers with: [ MARK # 35 ]... ./build/graphserver --datasetdir /home/ubuntu/filepool/simplegraph/parts_2/ --featuresfile /home/ubuntu/filepool/simplegraph/features.bsnap --labelsfile /home/ubuntu/filepool/simplegraph/labels.bsnap --dshmachinesfile /home/ubuntu/$shmachines --layerfile /home/ubuntu/simplegraph.config --pripfile /home/ubuntu/myprip --dataserverport 55431 --weig$tserverport 65433 --wserveripfile /home/ubuntu/wserverip --undirected 0 --tmpdir=/home/ubuntu/tmpfiles --cthreads 8
--dthreads 2 --dataport 5000 --ctrlport 7000 --nodeport 6000 --numlambdas 1 --numEpoch 10 --validationFrequency 1 -$MODE 2 --pipeline 0 --staleness 4294967295 --gnn GCN --preprocess 0 --timeout_ratio 5
[ Node 404 ] Engine starts initialization...
[ Node 404 ] Parsed configuration: dThreads = 2, cThreads = 8, datasetDir = /home/ubuntu/filepool/simplegraph/part$_2/, featuresFile = /home/ubuntu/filepool/simplegraph/features.bsnap, dshMachinesFile = /home/ubuntu/dshmachines, m$PrIpFile = /home/ubuntu/myprip, undirected = false, data port set -> 5000, control port set -> 7000, node port set $> 6000
[ Node 404 ] NodeManager starts initialization...
[ Node 1 ] Private IP: 172.31.73.108
[ Node 0 ] NodeManager initialization complete.
[ Node 0 ] CommManager starts initialization...
[ Node 1 ] NodeManager initialization complete.
[ Node 0 ] CommManager starts initialization...
[ Node 0 ] CommManager initialization complete.
[ Node 1 ] CommManager initialization complete.
[ Node 0 ] <GM>: 6 global vertices, 5 global edges,
3 local vertices, 2 local in edges, 3 local out edges
0 out ghost vertices, 1 in ghost vertices
[ Node 0 ] Loading feature cache from /home/ubuntu/filepool/simplegraph/parts_2/feats4.0.bin...
[ Node 1 ] <GM>: 6 global vertices, 5 global edges,
3 local vertices, 3 local in edges, 2 local out edges
1 out ghost vertices, 0 in ghost vertices
[ Node 1 ] Loading feature cache from /home/ubuntu/filepool/simplegraph/parts_2/feats4.1.bin...
[ Node 1 ] Engine initialization complete.
[ Node 0 ] Engine initialization complete.
[ Node 0 ] Number of epochs: 10, validation frequency: 1
[ Node 0 ] Sync Epoch 1 starts...
[ Node 1 ] Sync Epoch 1 starts...
scp: /home/ubuntu/tmpfiles/output_*: No such file or directory
scp: /home/ubuntu/tmpfiles/output_*: No such file or directory
Check the output results in "~/outfiles/" folder.
Check the running log under "~/logfiles/" folder.
Execution fails (at least on this node), check the log file!
Killing existing 'graphserver' processes...
graphserver: no process found
graphserver: no process found
Killing existing 'weightserver' processes...
Running WEIGHT servers with: [ MARK # 17 ]...
./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/simplegraph.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02
[ WS 0 ] Initializing nodes...
[ WS 0 ] All weight servers connected.
Binding weight server to tcp://*:65433...
[ WS 0 ] Layer 0 - Weights: (4, 8)
[ WS 0 ] Layer 1 - Weights: (8, 10)
[ WS 0 ] All nodes up to date.
[ INFO ] Number of lambdas set to 10.
Check the output results in "~/outfiles/" folder.
By the way, if I use ./run/run-onnode graph simplegraph cpu
command, the output log is just the same. It returns the same error.
Hi thanks for the detailed information.
So first a question about running the CPU version. Just making sure, but when you built Dorylus did you make sure to build with the CPU option? Because it will compile the specific ResComm that you specify during compilation and unfortunately cannot use any others unless you recompile it with different options first. If you have done this then my other thought is that the graph might be too small. We use that graph as an example for data formatting, but never actually ran on a graph of that size so it could be causing weird memory issues.
I recommend trying the reddit-small dataset as it is widely used and we have already used it with our system.You can find the dataset here.
If this doesn't work let us know. We can try recreating some of the issues your are experiencing to better assist you.
There are just a few other things I want to point out in case they are problems:
/aws/lambda/<function-name>
Thank you very much for your reply! I downloaded the reddit dataset here and found it consists of several files. But these files' formats do not conform to the required format described in the dorylus wiki 3. Prepare Input Dataset
. And I can not separate reddit-large and reddit-small datasets in these files. Could you tell me how I can generate required graph
, feature
and label
files from the dataset? Thank you.
CPU
enabled, the execution returns succeed
, but the output file is empty. batch Acc
and Loss
in the log is -nan
, so there must exists some problems in the execution. Maybe this can be solved later by using the Reddit-small dataset?
Part of the logs:
[ Node 0 ] Sync Epoch 5 starts...
[ Node 1 ] Time for epoch 4: 10.88ms
[ Node 1 ] Sync Epoch 5 starts...
[ Node 1 ] batch Acc: -nan, Loss: -nan
[ Node 0 ] batch Acc: -nan, Loss: -nan
[ Node 1 ] Time for epoch 5: 10.85ms
[ Node 1 ] Sync Epoch 6 starts...
[ Node 0 ] Time for epoch 5: 10.90ms
[ Node 0 ] Sync Epoch 6 starts...
[ Node 0 ] batch Acc: -nan, Loss: -nan
[ Node 1 ] batch Acc: -nan, Loss: -nan
gcn
lambda is not even invoked according to its log. I manually added a printLog(xxx)
at the head of the function LambdaComm::invokeLambda
in commmanager/lambda_comm.cpp
, but it prints nothing. Hi,
Glad to hear that you have set up the CPU version successfully. Regarding your questions,
0.1. You can refer to the DGL example (https://docs.dgl.ai/en/0.6.x/_modules/dgl/data/reddit.html) for the GNN background and how to prepare the dataset. I would suggest to start with their preprocessed dataset which shares a similar format (though still different) to ours. 0.2. Unfortunately, the original reddit dataset has different format from dorylus'. You have to write some scripts to convert the data into our format. Basically, the graph file is simply the edge list of the graph, where each line represents a directed edge. And the feature file consists of features of vertices, where each line is the feature of a vertex. The label file specifies which class the vertex belongs to, where each line represents the class of the corresponding vertex. Our scripts should handle the remaining tasks and convert these text files into binary files. 0.3. The dataset here is for reddit-small in the paper only. Reddit-large is constructed by ourselves from the raw reddit dataset.
1.1. It's ok if your output file is empty. You should be able to find all essential information in logs printed in the terminal.
1.2. The accuracy and loss are not reliable due to the synthetic dataset you are using. The simple graph in the doc is merely an example to show the format of the dataset files instead of a real-world graph dataset. According to your log, you have run the CPU verion successfully. So yes, reddit-small
will give you reasonable accuracy numbers.
./gnnman/send-source --force
) and re-build dorylus when you switch bewteen different backends. Given that you had run dorylus lambda version and invoked lambdas before, I think this can help you invoke your lambdas at least.Nits:
./run/run-onnode graph simplegraph --l 5 --e 5
. We typically passed the arguments with =
in our experiments, like ./run/run-onnode graph simplegraph --l=5 --e=5
. I am not sure if this leads to any problem. You can give it a try.Thank you very much for your help!
I use DGL's RedditDataset
(i.e., reddit-small dataset) as you recommended, and it can now run both in CPU mode and Lambda mode successfully (according to the master graphserver's log).
However, the outputs seem a little weird. Could you help me check out if the execution & outputs are correct?
Points that make me feel strange:
Time per stage
are all 0.000ms. Acc
values in the logs remain unchanged after epoch 2. Does it converge so fast? Or it caches the previous feature results?Firstly, I prepare the reddit dataset and run dorylus in CPU mode. Here are the useful logs:
[ Node 0 ] Preprocessing... Output to /filepool/reddit/parts_2/graph.0.bin
[ Node 1 ] CommManager initialization complete.
[ Node 1 ] Preprocessing... Output to /filepool/reddit/parts_2/graph.1.bin
[ Node 1 ] Finish preprocessing!
[ Node 1 ] <GM>: 232965 global vertices, 114615892 global edges,
113084 local vertices, 46862283 local in edges, 46862283 local out edges
66256 out ghost vertices, 66256 in ghost vertices
[ Node 1 ] No feature cache, loading raw data...
[ Node 1 ] Engine initialization complete.
[ Node 0 ] Finish preprocessing!
[ Node 0 ] <GM>: 232965 global vertices, 114615892 global edges,
119881 local vertices, 67753609 local in edges, 67753609 local out edges
76601 out ghost vertices, 76601 in ghost vertices
[ Node 0 ] No feature cache, loading raw data...
[ Node 0 ] Engine initialization complete.
[ Node 0 ] Number of epochs: 10, validation frequency: 1
[ Node 0 ] Sync Epoch 1 starts...
[ Node 1 ] Sync Epoch 1 starts...
[ Node 1 ] batch Acc: 0.096304, Loss: -nan
[ Node 0 ] batch Acc: 0.002169, Loss: nan
[ Node 0 ] Time for epoch 1: 9040.82ms
[ Node 0 ] Sync Epoch 2 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: nan
[ Node 0 ] Time for epoch 2: 9126.88ms
[ Node 0 ] Sync Epoch 3 starts...
[ Node 1 ] Time for epoch 2: 9126.90ms
[ Node 1 ] Sync Epoch 3 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 3: 8823.63ms
[ Node 0 ] Sync Epoch 4 starts...
[ Node 1 ] Time for epoch 3: 8823.63ms
[ Node 1 ] Sync Epoch 4 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 4: 8680.90ms
[ Node 0 ] Sync Epoch 5 starts...
[ Node 1 ] Time for epoch 4: 8680.90ms
[ Node 1 ] Sync Epoch 5 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 5: 8451.58ms
[ Node 0 ] Sync Epoch 6 starts...
[ Node 1 ] Time for epoch 5: 8451.56ms
[ Node 1 ] Sync Epoch 6 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 6: 8459.82ms
[ Node 0 ] Sync Epoch 7 starts...
[ Node 1 ] Time for epoch 6: 8459.83ms
[ Node 1 ] Sync Epoch 7 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 7: 8452.07ms
[ Node 0 ] Sync Epoch 8 starts...
[ Node 1 ] Time for epoch 7: 8452.08ms
[ Node 1 ] Sync Epoch 8 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 8: 8504.51ms
[ Node 0 ] Sync Epoch 9 starts...
[ Node 1 ] Time for epoch 8: 8504.50ms
[ Node 1 ] Sync Epoch 9 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 0 ] Time for epoch 9: 8764.25ms
[ Node 0 ] Sync Epoch 10 starts...
[ Node 1 ] Time for epoch 9: 8764.25ms
[ Node 1 ] Sync Epoch 10 starts...
[ Node 1 ] batch Acc: 0.096480, Loss: -nan
[ Node 0 ] batch Acc: 0.000584, Loss: -nan
[ Node 1 ] Time for epoch 10: 7847.84ms
[ Node 0 ] Time for epoch 10: 8753.58ms
[ Node 0 ] <EM>: Run start time: Tue Aug 10 12:22:58 2021
[ Node 0 ] <EM>: Run end time: Tue Aug 10 12:24:26 2021
[ Node 0 ] <EM>: Backend CPU
[ Node 0 ] <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node 0 ] <EM>: staleness: 4294967295
[ Node 0 ] <EM>: 10 sync epochs and 0 async epochs
[ Node 0 ] <EM>: Using 1 lambdas
[ Node 0 ] <EM>: Initialization takes 134452.368 ms
[ Node 0 ] Relaunched Lambda Cnt: 0
[ Node 0 ] <EM>: Average sync epoch time 8705.804 ms
[ Node 1 ] <EM>: Average sync epoch time 8615.241 ms
[ Node 1 ] <EM>: Average async epoch time 0.000 ms
<EM>: Run start time: Tue Aug 10 12:22:58 2021
<EM>: Run end time: Tue Aug 10 12:24:26 2021
<EM>: Using 1 lambdas
<EM>: Initialization takes 134452.368 ms
<EM>: Forward: Time per stage:
<EM> Aggregation 0 0.000 ms
<EM> ApplyVertex 0 0.000 ms
<EM> Scatter 0 0.000 ms
<EM> ApplyEdge 0 0.000 ms
<EM> Aggregation 1 0.000 ms
<EM> ApplyVertex 1 0.000 ms
<EM> Scatter 1 0.000 ms
<EM> ApplyEdge 1 0.000 ms
<EM>: Total forward-prop time 0.000 ms
<EM>: Backward: Time per stage:
<EM> Aggregation 2 0.000 ms
<EM> ApplyVertex 2 0.000 ms
<EM> Scatter 2 0.000 ms
<EM> ApplyEdge 2 0.000 ms
<EM> Aggregation 3 0.000 ms
<EM> ApplyVertex 3 0.000 ms
<EM> Scatter 3 0.000 ms
<EM> ApplyEdge 3 0.000 ms
<EM>: Total backward-prop time 0.000 ms
<EM>: Average epoch time 8705.804 ms
<EM>: Final accuracy 0.000
Relaunched Lambda Cnt: 0
After running CPU version Dorylus, I recompile the graphservers with Lambda backend, restart the weightserver, and run ./run/run-onnode graph reddit --l=5 --e=5
to run Lambda version Dorylus
...
[ Node 1 ] [ FUNC ERROR ] Unhandled, RequestId: 530d9c9b-4527-4bc6-805d-d3ac25e08c72 Error: Runtime exited with error: signal: killed
[ Node 0 ] curr 19631, timed 19235, chunk 5:B:0:3/3: vtx 1
[ Node 0 ] Relaunch lambda 5:B:0:3/3: vtx 1 for layer 0...
[ Node 1 ] curr 20335, timed 20045, chunk 5:B:0:3/8: vtx 1
[ Node 1 ] Relaunch lambda 5:B:0:3/8: vtx 1 for layer 0...
[ Node 1 ] curr 20349, timed 20045, chunk 5:B:0:4/9: vtx 1
[ Node 1 ] Relaunch lambda 5:B:0:4/9: vtx 1 for layer 0...
[ Node 1 ] Time for epoch 5: 40582.43ms
[ Node 0 ] Time for epoch 5: 40789.55ms
[ Node 0 ] <EM>: Run start time: Tue Aug 10 12:38:16 2021
[ Node 0 ] <EM>: Run end time: Tue Aug 10 12:42:37 2021
[ Node 0 ] <EM>: Backend LAMBDA:gcn
[ Node 0 ] <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node 0 ] <EM>: staleness: 4294967295
[ Node 0 ] <EM>: 5 sync epochs and 0 async epochs
[ Node 0 ] <EM>: Using 5 lambdas
[ Node 0 ] <EM>: Initialization takes 2939.644 ms
[ Node 0 ] Relaunched Lambda Cnt: 8
[ Node 0 ] <EM>: Average sync epoch time 52276.335 ms
[ Node 1 ] <EM>: Average sync epoch time 52234.906 ms
[ Node 0 ] <EM>: Average async epoch time 0.000 ms
[ Node 1 ] <EM>: Average async epoch time 0.000 ms
[ Node 1 ] ^[[1;31m[ FUNC ERROR ]^[[0m Unhandled, RequestId: 530d9c9b-4527-4bc6-805d-d3ac25e08c72 Error: Runtime exited with error: signal: killed
[ Node 0 ] curr 19631, timed 19235, chunk 5:B:0:3/3: vtx 1
[ Node 0 ] Relaunch lambda 5:B:0:3/3: vtx 1 for layer 0...
[ Node 1 ] curr 20335, timed 20045, chunk 5:B:0:3/8: vtx 1
[ Node 1 ] Relaunch lambda 5:B:0:3/8: vtx 1 for layer 0...
[ Node 1 ] curr 20349, timed 20045, chunk 5:B:0:4/9: vtx 1
[ Node 1 ] Relaunch lambda 5:B:0:4/9: vtx 1 for layer 0...
[ Node 1 ] Time for epoch 5: 40582.43ms
[ Node 0 ] Time for epoch 5: 40789.55ms
[ Node 0 ] <EM>: Run start time: Tue Aug 10 12:38:16 2021
[ Node 0 ] <EM>: Run end time: Tue Aug 10 12:42:37 2021
[ Node 0 ] <EM>: Backend LAMBDA:gcn
[ Node 0 ] <EM>: Dataset: /filepool/reddit/parts_2/ (602, 128, 41)
[ Node 0 ] <EM>: staleness: 4294967295
[ Node 0 ] <EM>: 5 sync epochs and 0 async epochs
[ Node 0 ] <EM>: Using 5 lambdas
[ Node 0 ] <EM>: Initialization takes 2939.644 ms
[ Node 0 ] Relaunched Lambda Cnt: 8
[ Node 0 ] <EM>: Average sync epoch time 52276.335 ms
[ Node 1 ] <EM>: Average sync epoch time 52234.906 ms
[ Node 0 ] <EM>: Average async epoch time 0.000 ms
[ Node 1 ] <EM>: Average async epoch time 0.000 ms
./build/weightserver /home/ubuntu/dshmachines /home/ubuntu/myprip /home/ubuntu/gserverip 5000 65433 55431 /home/ubuntu/reddit.config /home/ubuntu/tmpfiles 1 1 0 GCN 0.01 0.02
[ WS 0 ] Initializing nodes...
[ WS 0 ] All weight servers connected.
[ WS 0 ] Layer 0 - Weights: (602, 128)
[ WS 0 ] Layer 1 - Weights: (128, 41)
[ WS 0 ] All nodes up to date.
Binding weight server to tcp://*:65433...
[ INFO ] Number of lambdas set to 10.
[ WS 0 ] Epoch 1, acc: 0.0530, loss: -nan
[ WS 0 ] w Local Layer 1 Current version: 1, active weight version cnt: 1 0:10 0:0 -> (1:0, )
[ WS 0 ] w Local Layer 0 Current version: 1, active weight version cnt: 1 0:10 0:0 -> (1:0, )
[ WS 0 ] Epoch 2, acc: 0.0548, loss: -nan
[ WS 0 ] w Local Layer 1 Current version: 2, active weight version cnt: 1 0:10 0:0 -> (2:0, )
[ WS 0 ] w Local Layer 0 Current version: 2, active weight version cnt: 1 0:10 0:0 -> (2:0, )
[ WS 0 ] Epoch 3, acc: 0.0548, loss: -nan
[ WS 0 ] w Local Layer 1 Current version: 3, active weight version cnt: 1 0:10 0:0 -> (3:0, )
[ WS 0 ] w Local Layer 0 Current version: 3, active weight version cnt: 1 0:10 0:0 -> (3:0, )
[ WS 0 ] Epoch 4, acc: 0.0548, loss: -nan
[ WS 0 ] w Local Layer 1 Current version: 4, active weight version cnt: 1 0:10 0:0 -> (4:0, )
[ WS 0 ] w Local Layer 0 Current version: 4, active weight version cnt: 1 0:10 0:0 -> (4:0, )
[ WS 0 ] Epoch 5, acc: 0.0548, loss: -nan
[ WS 0 ] w Local Layer 1 Current version: 5, active weight version cnt: 1 0:10 0:0 -> (5:0, )
[SHUTDOWN] Deleting workers
[SHUTDOWN] Server shutting down...
We are terminating the weight server
[SHUTDOWN] Closing ZMQ
Assertion failed: pfd.revents & POLLIN (src/signaler.cpp:261)
Check the output results in "~/outfiles/" folder.
<EM>: Run start time: Tue Aug 10 12:38:16 2021
<EM>: Run end time: Tue Aug 10 12:42:37 2021
<EM>: Using 5 lambdas
<EM>: Initialization takes 2939.644 ms
<EM>: Forward: Time per stage:
<EM> Aggregation 0 0.000 ms
<EM> ApplyVertex 0 0.000 ms
<EM> Scatter 0 0.000 ms
<EM> ApplyEdge 0 0.000 ms
<EM> Aggregation 1 0.000 ms
<EM> ApplyVertex 1 0.000 ms
<EM> Scatter 1 0.000 ms
<EM> ApplyEdge 1 0.000 ms
<EM>: Total forward-prop time 0.000 ms
<EM>: Backward: Time per stage:
<EM> Aggregation 2 0.000 ms
<EM> ApplyVertex 2 0.000 ms
<EM> Scatter 2 0.000 ms
<EM> ApplyEdge 2 0.000 ms
<EM> Aggregation 3 0.000 ms
<EM> ApplyVertex 3 0.000 ms
<EM> Scatter 3 0.000 ms
<EM> ApplyEdge 3 0.000 ms
<EM>: Total backward-prop time 0.000 ms
<EM>: Average epoch time 52276.335 ms
<EM>: Final accuracy 0.000
Relaunched Lambda Cnt: 8
Great, the system is now working. But you are right, it is not learning according to the posted output.
Questions 2,3, & 5: sorry about the confusion. the output files generated are useless after the code refacted for the artifact. You should only pay attention to the logs printed out in the terminal:
Question 1 & 4: no, the loss shouldn't be nan
and the accuracy should increase per epoch if it learns correctly. Like what Figure 5 in the paper shows, the accuracy should reach 0.90 quickly (within 20 epochs), while the loss decreases correspondingly.
My few suggestions:
benchmark/
to set the staleness and learning rate correctly.Thanks for your help! I try two datasets from DGL again: KarateClubDataset and cora.
KarateClub can only successfully run on CPU version. Loss
values are still nan
if I use it on lambda version.
But cora can now both on CPU and Lambda version now, and the Acc
and Loss
values are not nan
. It is awesome!
Thank you for your help again! I can start my journey with Dorylus now.
I can provide some observations that I found when I was debugging (although you might have found them already):
The nan
number comes from the softmax(z)
function. For example, in CPU version, it is in src/graph-server/commmanager/CPU_comm.cpp
. Some vecSrc
values are nan
in the 6th line of the function.
nan
vecSrc values themselves come from the dot
operation, i.e., Matrix z = feats.dot(weight);
, in the 3rd line of CPUComm::vtxNNForwardGCN
. I have no idea why this dot
operation will generate nan values.
As there is an available dataset, I don't need to dig into why reddit and karateclub dataset causes nan
"Loss" and "Acc" values anymore. But I'm worried that this problem will trouble me later.
Could you tell me why Dorylus does not open source an available dataset? Maybe it's because this job is easy for most GNN developers? If needed, I'm glad to offer my scripts to generate an available dataset. i.e., cora.
Dear maintainers, I'm running dorylus on lambda right now. My graph server master is stuck at Epoch 1, and then it returns:
"[ Node 0 ] [ FUNC ERROR ] Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds".
My lambda prints no log so I have no idea what's going on. Could you help me solve the problem? Thank you very much!More details are presented below.
My Cluster
graphserver 2 EC2 c5n.2xlarge us-east-1f
weightserver 1 EC2 t2.medium us-east-1c
lambda Named "gcn". Memory 192MB, timeout 10min.
It is configured to be in the same VPC of the graphservers and weightserver (but GSs and WS are not in the same subnet, as is shown above).
My Input Dataset and Configs
Actually I don't have much knowledge about machine learning, so I use the example simple graph presented in the tutorial as the input dataset.
small.graph
feature
labels
layerconfig (renamed to
simplegraph.config
later)Other configs All other configs (e.g.,
gserverport
,nodeport
) use the default value.Commands and Full Logs
Command:
./run/run-onnode graph simplegraph --l 5 --e 5
./run/run-onnode weight simplegraph
Full Logs
No log is presented in
CloudWatch
.