uclasystem / dorylus

Dorylus: Affordable, Scalable, and Accurate GNN Training
77 stars 12 forks source link

Execution is stuck at Epoch 1 #3

Closed CodingYuanLiu closed 2 years ago

CodingYuanLiu commented 3 years ago

Dear maintainers, I'm running dorylus on lambda right now. My graph server master is stuck at Epoch 1, and then it returns: "[ Node 0 ] [ FUNC ERROR ] Unhandled, 2021-07-30T03:50:31.864Z 7caf9b09-14b0-4cc9-9ca9-edd041fea72a Task timed out after 600.02 seconds". My lambda prints no log so I have no idea what's going on. Could you help me solve the problem? Thank you very much!

More details are presented below.

My Cluster

My Input Dataset and Configs

Actually I don't have much knowledge about machine learning, so I use the example simple graph presented in the tutorial as the input dataset.

Commands and Full Logs

johnnt849 commented 3 years ago

Hi,

I think there are two problems with the execution: 1) For some of our experiments we have hard-coded the region into the Lambda client which you can see here. Change this to the region you are using otherwise it will try to launch lambdas in the wrong region.

2) It also seems that there is some problem with the graph server finding your graph files as you can see from the output. Make sure that the directory structure is exactly as described in the wiki in section 3 as the path is currently hard-coded. Also make sure that permissions on the directory are set correctly. Hopefully in a future update we can make the path less strict.

Hope this helps, and good luck!

CodingYuanLiu commented 3 years ago

Thank you very much for your reply! But there still exists some problems.

  1. I've changed the hard-coded lambda region before. So probably this is not the cause.
  2. I change the hard coded graph file path in run/run-onnode, and now it does not report "Can not open file..." errors. However, it now stops quickly after log "Sync Epoch 1 starts...", reporting "Execution fails (at least on this node), check the log file". The log file does not record any errors. The output file is empty (although the file is created at weight server). Lambda still does not have logs in CloudWatch.

Full logs

By the way, if I use ./run/run-onnode graph simplegraph cpu command, the output log is just the same. It returns the same error.

johnnt849 commented 3 years ago

Fixing stuck at epoch 1

Hi thanks for the detailed information.

So first a question about running the CPU version. Just making sure, but when you built Dorylus did you make sure to build with the CPU option? Because it will compile the specific ResComm that you specify during compilation and unfortunately cannot use any others unless you recompile it with different options first. If you have done this then my other thought is that the graph might be too small. We use that graph as an example for data formatting, but never actually ran on a graph of that size so it could be causing weird memory issues.

I recommend trying the reddit-small dataset as it is widely used and we have already used it with our system.You can find the dataset here.

If this doesn't work let us know. We can try recreating some of the issues your are experiencing to better assist you.

Other things

There are just a few other things I want to point out in case they are problems:

  1. Make sure the instances you use have a security group that allows them to communicate with all other instances in the VPC. By default all ports are blocked.
  2. To get CloudWatch logs from Lambda you might have to attach an arn to the lambdas that has permissions to write to CloudWatch, which will them generate a CloudWatch log group under /aws/lambda/<function-name>
CodingYuanLiu commented 3 years ago

Thank you very much for your reply! I downloaded the reddit dataset here and found it consists of several files. But these files' formats do not conform to the required format described in the dorylus wiki 3. Prepare Input Dataset . And I can not separate reddit-large and reddit-small datasets in these files. Could you tell me how I can generate required graph, feature and label files from the dataset? Thank you.

More information

  1. After recompile graphserver with CPU enabled, the execution returns succeed, but the output file is empty. batch Acc and Loss in the log is -nan, so there must exists some problems in the execution. Maybe this can be solved later by using the Reddit-small dataset? Part of the logs:
    [ Node   0 ]  Sync Epoch 5 starts...
    [ Node   1 ]  Time for epoch 4: 10.88ms
    [ Node   1 ]  Sync Epoch 5 starts...
    [ Node   1 ]  batch Acc: -nan, Loss: -nan
    [ Node   0 ]  batch Acc: -nan, Loss: -nan
    [ Node   1 ]  Time for epoch 5: 10.85ms
    [ Node   1 ]  Sync Epoch 6 starts...
    [ Node   0 ]  Time for epoch 5: 10.90ms
    [ Node   0 ]  Sync Epoch 6 starts...
    [ Node   0 ]  batch Acc: -nan, Loss: -nan
    [ Node   1 ]  batch Acc: -nan, Loss: -nan
  2. I have set the security groups of the EC2 instances to open all the ports, so this is not the problem.
  3. I re-run the lambda version graphserver today and its error message remains the same. However, this time, the gcn lambda is not even invoked according to its log. I manually added a printLog(xxx) at the head of the function LambdaComm::invokeLambda in commmanager/lambda_comm.cpp, but it prints nothing.
ivanium commented 3 years ago

Hi,

Glad to hear that you have set up the CPU version successfully. Regarding your questions,

0.1. You can refer to the DGL example (https://docs.dgl.ai/en/0.6.x/_modules/dgl/data/reddit.html) for the GNN background and how to prepare the dataset. I would suggest to start with their preprocessed dataset which shares a similar format (though still different) to ours. 0.2. Unfortunately, the original reddit dataset has different format from dorylus'. You have to write some scripts to convert the data into our format. Basically, the graph file is simply the edge list of the graph, where each line represents a directed edge. And the feature file consists of features of vertices, where each line is the feature of a vertex. The label file specifies which class the vertex belongs to, where each line represents the class of the corresponding vertex. Our scripts should handle the remaining tasks and convert these text files into binary files. 0.3. The dataset here is for reddit-small in the paper only. Reddit-large is constructed by ourselves from the raw reddit dataset.

1.1. It's ok if your output file is empty. You should be able to find all essential information in logs printed in the terminal. 1.2. The accuracy and loss are not reliable due to the synthetic dataset you are using. The simple graph in the doc is merely an example to show the format of the dataset files instead of a real-world graph dataset. According to your log, you have run the CPU verion successfully. So yes, reddit-small will give you reasonable accuracy numbers.

  1. There must be something wrong if your added debug messages didn't show up. I would suggest to delete everything on all servers (e.g., by running ./gnnman/send-source --force) and re-build dorylus when you switch bewteen different backends. Given that you had run dorylus lambda version and invoked lambdas before, I think this can help you invoke your lambdas at least.

Nits:

  1. I noticed in your first post, you ran dorylus graph server with ./run/run-onnode graph simplegraph --l 5 --e 5. We typically passed the arguments with = in our experiments, like ./run/run-onnode graph simplegraph --l=5 --e=5. I am not sure if this leads to any problem. You can give it a try.
  2. With regard to the cloudwatch issue, I cannot think of any specific reason for now. Have you tried to run a simple hello world lambda function and let it print something to cloudwatch to check if all the setup is good?
  3. For your layerconfig, I would suggest to set the last layer dimension the same as the number of your lable kinds. For example, 4 for your simplegraph dataset.
CodingYuanLiu commented 3 years ago

Thank you very much for your help!

I use DGL's RedditDataset (i.e., reddit-small dataset) as you recommended, and it can now run both in CPU mode and Lambda mode successfully (according to the master graphserver's log).

However, the outputs seem a little weird. Could you help me check out if the execution & outputs are correct?

Points that make me feel strange:

  1. Loss values in the log are all "-nan"
  2. Output files in the weightserver are empty
  3. Output files in the graphservers show that Time per stage are all 0.000ms.
  4. The Acc values in the logs remain unchanged after epoch 2. Does it converge so fast? Or it caches the previous feature results?
  5. By the way, what are the expected outputs of Dorylus when the input is reddit-small dataset?

Details

CPU version Dorylus

Firstly, I prepare the reddit dataset and run dorylus in CPU mode. Here are the useful logs:

Lambda version Dorylus

After running CPU version Dorylus, I recompile the graphservers with Lambda backend, restart the weightserver, and run ./run/run-onnode graph reddit --l=5 --e=5 to run Lambda version Dorylus

ivanium commented 3 years ago

Great, the system is now working. But you are right, it is not learning according to the posted output.

Questions 2,3, & 5: sorry about the confusion. the output files generated are useless after the code refacted for the artifact. You should only pay attention to the logs printed out in the terminal:

Question 1 & 4: no, the loss shouldn't be nan and the accuracy should increase per epoch if it learns correctly. Like what Figure 5 in the paper shows, the accuracy should reach 0.90 quickly (within 20 epochs), while the loss decreases correspondingly.

My few suggestions:

  1. Use more lambdas for reddit-small (e.g., 80 lambdas per graph server). 5 lambdas might be too few to hold the data chunk and they kept timing out according to your logs.
  2. Double check the dataset is correct. Like the feature file, label file, and the graph file have their vertex ID matched, features are loaded correctly, etc.
  3. DGL also provides other small datasets like cora and pubmed, which might be good for debugging and checking the configuration.
  4. Refer the scripts in benchmark/ to set the staleness and learning rate correctly.
CodingYuanLiu commented 3 years ago

Thanks for your help! I try two datasets from DGL again: KarateClubDataset and cora. KarateClub can only successfully run on CPU version. Loss values are still nan if I use it on lambda version. But cora can now both on CPU and Lambda version now, and the Acc and Loss values are not nan. It is awesome! Thank you for your help again! I can start my journey with Dorylus now.

Other information

I can provide some observations that I found when I was debugging (although you might have found them already): The nan number comes from the softmax(z) function. For example, in CPU version, it is in src/graph-server/commmanager/CPU_comm.cpp. Some vecSrc values are nan in the 6th line of the function. nan vecSrc values themselves come from the dot operation, i.e., Matrix z = feats.dot(weight);, in the 3rd line of CPUComm::vtxNNForwardGCN. I have no idea why this dot operation will generate nan values. As there is an available dataset, I don't need to dig into why reddit and karateclub dataset causes nan "Loss" and "Acc" values anymore. But I'm worried that this problem will trouble me later.

Dataset open source

Could you tell me why Dorylus does not open source an available dataset? Maybe it's because this job is easy for most GNN developers? If needed, I'm glad to offer my scripts to generate an available dataset. i.e., cora.