gunrock / gunrock

Programmable CUDA/C++ GPU Graph Analytics
https://gunrock.github.io/gunrock/
Apache License 2.0
983 stars 201 forks source link

request for Volta numbers on 9 datasets #751

Open jowens opened 4 years ago

jowens commented 4 years ago

Searched our results repo for 9 datasets on 4 Gunrock primitives on V100. SSSP is directed; the other three are undirected.

CC: @neoblizz @AjayBrahmakshatriya @crozhon

github doesn't allow .html attachments so I'm renaming my html table as text and attaching it below.

ugf_table.txt

Pasting the source code here for posterity.

ugf = df.copy()
ugf_datasets = [
    "soc-orkut",
    "soc-twitter-2010",
    "soc-LiveJournal1",
    "soc-sinaweibo",
    "hollywood-2009",
    "indochina-2004",
    "road_usa",
    "road_central",
    "roadNet-CA",
]
ugf = ugf[
    (ugf["gpuinfo_name"] == "Tesla V100")
    & (ugf["dataset"].isin(ugf_datasets))
    & (
        ((ugf["undirected"] == True) & ugf["primitive"].isin(["dobfs", "pr", "bc"]))
        | ((ugf["undirected"] == False) & ugf["primitive"].isin(["sssp"]))
    )
]

ugf_fastest = ugf.groupby(["dataset", "primitive", "undirected"])[
    "avg_process_time"
].transform(min)
ugf = ugf[ugf["avg_process_time"] == ugf_fastest]

save(
    chart=alt.Chart(),
    df=ugf,
    plotname="ugf",
    outputdir="../plots",
    formats=["tablehtml"],
    sortby=[
        "primitive",
        "dataset",
        "engine",
        "gunrock_version",
        "advance_mode",
        "undirected",
        "mark_pred",
        "idempotence",
    ],
    columns=[
        "primitive",
        "dataset",
        "avg_process_time",
        "avg_mteps",
        "engine",
        "num_vertices",
        "num_edges",
        "gunrock_version",
        "advance_mode",
        "undirected",
        "mark_pred",
        "idempotence",
        "gpuinfo_name",
        "gpuinfo_name_full",
        "time",
        "details",
    ],
    mdtext="",
)
jowens commented 4 years ago

In general I think our numbers are a little better than the ones reported to us for BC, PR, and SSSP. Some of the BFS runs reported to us are faster than what we had, so we're very interested in the settings for those.

AjayBrahmakshatriya commented 4 years ago

Thanks a lot for reporting this discrepancy and also providing the exact commands to reproduce these results. I will run the commands and generate a similar table. I also have the scripts we used in the first place to produce the numbers we reported in the paper. I can find the parameters that produce better numbers on BFS.

jowens commented 4 years ago

@AjayBrahmakshatriya any good settings for us?

jowens commented 4 years ago

@AjayBrahmakshatriya Checkin' in.

AjayBrahmakshatriya commented 4 years ago

Hi @jowens and @neoblizz, I reran the experiments on our machine with the flags you had provided. For each of the flag it produced a lot of json files (because some of the flags had multiple options). So I chose the fastest average-process-time out of those. Here are the results for the experiments for SSSP, BFS and BC -

BC

Graph Gunrock-reported Our new Our old
hollywood-2009 14.445 13.678 29.030
indochina-2004 29.980 30.397 47.970
roadNet-CA 95.989 134.038 86.440
road_central 582.998 1012.211 632.970
road_usa 1001.980 1593.763 987.930
soc-LiveJournal1 36.388 36.056 88.040
soc-orkut 73.996 71.183 213.930
soc-sinaweibo 557.461 548.531 1095.600
soc-twitter-2010 228.118 227.485 505.310

BFS

Graph Gunrock-reported Our new Our old
hollywood-2009 1.008 1.880 1.670
indochina-2004 13.623 13.266 12.570
roadNet-CA 51.087 81.530 68.350
road_central 301.835 521.023 434.010
road_usa 515.051 766.221 775.160
soc-LiveJournal1 3.021 3.814 3.630
soc-orkut 3.986 4.305 1.660
soc-sinaweibo 10.741 10.805 94.770
soc-twitter-2010 15.432 16.221 13.520

SSSP (weights 0-1000)

Graph Gunrock-reported Our new Our old
hollywood-2009 6.136 7.292 5.870
indochina-2004 9.080 16.909 13.630
roadNet-CA 48.623 80.462 66.790
road_central 336.540 590.307 429.240
road_usa 499.529 908.687 788.230
soc-LiveJournal1 15.351 18.103 14.550
soc-orkut 7.878 32.933 243.050
soc-sinaweibo 271.258 267.738 235.750
soc-twitter-2010 115.023 106.935 97.520

Unfortunately for PR, I couldn't reproduce the experiments because all the experiments segfault before printing any timings or producing the json files. I am currently on this commit - commit 7c197d6a498806fcfffd1f9304c663379a77f5e4 (HEAD, tag: v1.1). @neoblizz do you know what might be the reason this is happening?

Anyway, for rest of the experiments some of the numbers are better than our numbers before and we are happy to change those in the paper. Although for some of the road graphs there is still some discrepancy between what you have reported and what the new runs reported. Our first guess was this is because of the different weight distribution. But it is also there for BFS and BC which are unweighted.

Anyway clue why this might be happening?

neoblizz commented 4 years ago

Unfortunately for PR, I couldn't reproduce the experiments because all the experiments segfault before printing any timings or producing the json files. I am currently on this commit - commit 7c197d6a498806fcfffd1f9304c663379a77f5e4 (HEAD, tag: v1.1). @neoblizz do you know what might be the reason this is happening?

I am investigating a bug in PR with @crozhon right now, what was the exact error that you encountered here?

Although for some of the road graphs there is still some discrepancy between what you have reported and what the new runs reported. Our first guess was this is because of the different weight distribution. But it is also there for BFS and BC which are unweighted.

Was there a random source nodes for these runs, maybe that is the cause? How much was the difference, it would help if I had the command lines for these runs. Also, some changes were made since some of those runs which may have caused minor differences as well.

jowens commented 4 years ago

Can you distinguish between "our new" and "our old"? "our new" is ... you reran with the command line we gave you and "our new" is that result, whereas "our old" is what you ran on your own?

Certainly we are interested in the command lines where "our old" is faster than our run.

jowens commented 4 years ago

(FWIW Muhammad redesigned the command lines so that you could specify multiple runs with one command line; you'll see something like --param=true,false and it will do separate runs with --param=true and --param=false. This is helpful because then we only have to load the graph once; it cuts down our time substantially on large multi-parameter runs. If you look at the parameters in the JSONs linked in the table, you can cut down the command line to only run the fastest set of parameters and then you won't have to dig through multiple command lines.)

AjayBrahmakshatriya commented 4 years ago

@neoblizz there wasn't an error, the execution crashed with a segmentation fault after the pagerank run. @jowens oh yes, I forgot to clarify the columns in the table. The first column has the numbers you had shared with me a week ago. The third column is the numbers we had in our paper (with the tuning of parameters we had done). The second column is the numbers from the recent experiments. For the cases where the third column is the fastest, I think it is because of the better tuned do_a and do_b parameters. The one with a significant speed up seems to be for soc-orkut on BFS. I have run the parameter sweep script again for this graph and I will find the optimum parameters and get back to you.

jowens commented 4 years ago

Thank you for being so conscientious about this.

AjayBrahmakshatriya commented 4 years ago

Actually, the sweep just showed a number reasonably close with the command line - ./bfs market soc-orkut.mtx --direction-optimized --do-a=0.012 --do-b=0.012 --src=0 --num-runs=10 --quick --device=5 The average execution time is - 1.7415 ms (as compared to 3.986 ms from your experiments and 1.66 ms in our paper). Our input graphs are already symmetrized in this case, so I haven't added the --undirected flag.

AjayBrahmakshatriya commented 4 years ago

(FWIW Muhammad redesigned the command lines so that you could specify multiple runs with one command line; you'll see something like --param=true,false and it will do separate runs with --param=true and --param=false. This is helpful because then we only have to load the graph once; it cuts down our time substantially on large multi-parameter runs. If you look at the parameters in the JSONs linked in the table, you can cut down the command line to only run the fastest set of parameters and then you won't have to dig through multiple command lines.)

Oh yes! I realized from the new command lines you had sent. I modified them to choose the best parameters as you had shown in the table. But they still weren't matching the numbers you had sent. So just to be sure, I ran the command as it is (except for minor changes like not exploring 64 bit vertex ids and edge ids) and scanned all the generated JSONs to find the best one. The numbers in the table show the best.

neoblizz commented 4 years ago

@neoblizz there wasn't an error, the execution crashed with a segmentation fault after the pagerank run.

Since the commit you are working on is from Sept. 2019, I believe the segfault bug was solved and should be fixed in the master (HEAD). There's a separate issue that I am working on, but using the latest commit should work for PageRank. The official release is just pending.

AjayBrahmakshatriya commented 4 years ago

@neoblizz there wasn't an error, the execution crashed with a segmentation fault after the pagerank run.

Since the commit you are working on is from Sept. 2019, I believe the segfault bug was solved and should be fixed in the master (HEAD). There's a separate issue that I am working on, but using the latest commit should work for PageRank. The official release is just pending.

Sounds good! If you think the performance with v1.1 and master (HEAD) should be close, I will use the latest commit for the PR experiments. Thanks for looking into this. I will let you know if it works.

jowens commented 4 years ago

But they still weren't matching the numbers you had sent.

I promise we are suuuuuper honest about our numbers. As are you. Thank you.

AjayBrahmakshatriya commented 4 years ago

But they still weren't matching the numbers you had sent.

I promise we are suuuuuper honest about our numbers. As are you. Thank you.

Ofcourse! 😄

And since the issue is with just the road graphs, my guess is it could be a difference in the input files. I will download the datasets again (I am using the same ones that you have in your repository) and not do any preprocessing and run the experiments again.

AjayBrahmakshatriya commented 4 years ago

@neoblizz there wasn't an error, the execution crashed with a segmentation fault after the pagerank run.

Since the commit you are working on is from Sept. 2019, I believe the segfault bug was solved and should be fixed in the master (HEAD). There's a separate issue that I am working on, but using the latest commit should work for PageRank. The official release is just pending.

Sounds good! If you think the performance with v1.1 and master (HEAD) should be close, I will use the latest commit for the PR experiments. Thanks for looking into this. I will let you know if it works.

@neoblizz I pulled the latest master branch and tried recompiling gunrock, but I am getting a lot of cmake errors on the lines of -

CMake Error at ~/scratch/cmake-3.17.2-Linux-x86_64/share/cmake-3.17/Modules/FindCUDA.cmake:1837 (add_library):
  Cannot find source file:

    ......./gunrockv1.1/externals/moderngpu/src/context.hxx

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
  .hpp .hxx .in .txx
Call Stack (most recent call first):
  gunrock/CMakeLists.txt:47 (CUDA_ADD_LIBRARY)

There is one such error for each binary too including pr. I am using cmake 3.17.2 (which is what I was using previously for 7c197d6 too and it compiled fine).

Is this a transient issue with the latest commit? Is there a stable commit from before that I can use?

neoblizz commented 4 years ago

@neoblizz I pulled the latest master branch and tried recompiling gunrock, but I am getting a lot of cmake errors on the lines of -

CMake Error at ~/scratch/cmake-3.17.2-Linux-x86_64/share/cmake-3.17/Modules/FindCUDA.cmake:1837 (add_library):
  Cannot find source file:

    ......./gunrockv1.1/externals/moderngpu/src/context.hxx

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
  .hpp .hxx .in .txx
Call Stack (most recent call first):
  gunrock/CMakeLists.txt:47 (CUDA_ADD_LIBRARY)

There is one such error for each binary too including pr. I am using cmake 3.17.2 (which is what I was using previously for 7c197d6 too and it compiled fine).

Is this a transient issue with the latest commit? Is there a stable commit from before that I can use?

Seems like you didn't fetch the submodules again since we now use moderngpu 2.0 instead of 1.0.

git submodule update --init
AjayBrahmakshatriya commented 4 years ago

@neoblizz I pulled the latest master branch and tried recompiling gunrock, but I am getting a lot of cmake errors on the lines of -

CMake Error at ~/scratch/cmake-3.17.2-Linux-x86_64/share/cmake-3.17/Modules/FindCUDA.cmake:1837 (add_library):
  Cannot find source file:

    ......./gunrockv1.1/externals/moderngpu/src/context.hxx

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
  .hpp .hxx .in .txx
Call Stack (most recent call first):
  gunrock/CMakeLists.txt:47 (CUDA_ADD_LIBRARY)

There is one such error for each binary too including pr. I am using cmake 3.17.2 (which is what I was using previously for 7c197d6 too and it compiled fine). Is this a transient issue with the latest commit? Is there a stable commit from before that I can use?

Seems like you didn't fetch the submodules again since we now use moderngpu 2.0 instead of 1.0.

git submodule update --init

I just pulled the latest version of gunrock again (with --recursive) and it seemed to have fixed the problem. I am running the experiments right now for PageRank. Please let me know if you want the flags for any other experiments (that we had originally used in the paper).

crozhon commented 4 years ago

I just fixed some issues with using multiple options at once (--pull=false,true) and an issue with OpenMP that is only on dev. If you have another issue try checking that out.

jowens commented 4 years ago

Simply for our future use, it would be nice to have the settings for everything you tested, yes. But the primitive-specific settings are specifically what we'll use straightaway.

jowens commented 4 years ago

@AjayBrahmakshatriya any set of settings that you can send us for your runs that were faster than ours?

AjayBrahmakshatriya commented 4 years ago

I just checked the logs from our experiments and the only set of experiments where our numbers were faster than your reported numbers are for BFS for the following two graphs. I have specified the command line arguments and the reported average time here. For the rest your reported numbers were faster and with your mentioned flags we were able to match those.

I just reran these experiments to make sure there isn't a mix up in the flags. These experiments are run on a DGX1 with a V-100 (single GPU)

Graph Command line Gunrock reported Our experiments
soc-orkut ./bfs market soc-orkut.mtx --direction-optimized --do-a=0.0 12 --do-b=0.012 --src=0 --num-runs=10 --quick --device=5 3.986 1.680
soc-twitter-2010 ./bfs market soc-twitter-2010.mtx --direction-optimized --d o-a=0.003 --do-b=0.003 --src=0 --num-runs=10 --quick --device=5 15.432 13.65

jowens commented 4 years ago

thank you @AjayBrahmakshatriya!