nv-morpheus / Morpheus

Morpheus SDK
Apache License 2.0
309 stars 119 forks source link

[BUG]: Nondeterministic results from gnn_fraud_detection_pipeline example #1676

Closed dagardner-nv closed 1 month ago

dagardner-nv commented 2 months ago

Version

24.06 & 23.11

Which installation method(s) does this occur on?

Source

Describe the bug.

Running this pipeline from the run.py script yields different results with each run.

Minimum reproducible example

cd ${MORPHEUS_ROOT}/examples/gnn_fraud_detection_pipeline
python run.py --output_file output1.csv
python run.py --output_file output2.csv
${MORPHEUS_ROOT}/scripts/compare_data_files.py --index_col=index output1.csv output2.csv

Relevant log output

Click here to see error details

Results do not match. Diff 191/265 (72.075472 %). First 10 mismatched rows:
              1000  1001  client_node  merchant_node    1004  1005  1006  ...  ind_emb_57  ind_emb_58  ind_emb_59  ind_emb_60  ind_emb_61  ind_emb_62  ind_emb_63
index                                                                     ...                                                                                    
753   res    64.87     1        80776          91780  100482     1     1  ...    0.802462   -0.290038    0.427980    0.198512    0.051006   -0.343573    0.464877
      val    64.87     1        80776          91780  100482     1     1  ...    0.331588   -0.743460    0.057602    0.929850   -1.243031   -0.505327    0.655568
757   res  1039.87     0        86378          91782  100499     1     1  ...   -0.224234   -0.638873    3.208107   -0.581114    0.397186    0.054665   -0.771643
      val  1039.87     0        86378          91782  100499     1     1  ...   -0.180428   -0.802617    2.976311   -0.472270    0.143358    0.099732   -0.657378
758   res   130.00     1        60551          92009  100486     1     1  ...    1.158630    0.328182   -1.933454    1.046066   -1.425990   -0.595764    3.954228
      val   130.00     1        60551          92009  100486     1     1  ...    1.257934    0.483261   -1.617801    1.167611   -1.563025   -0.663233    3.842766
759   res   429.91     0        53182          91831  100510     1     1  ...    0.037058    3.374917    5.680276   -1.872390    2.608538    0.077500   -1.728548
      val   429.91     0        53182          91831  100510     1     1  ...    0.091996    3.370662    5.754779   -1.836666    2.449290    0.051542   -1.720287
760   res    19.50     1        87501          91775  100519     1     1  ...   -3.088483    3.069178    0.803048    1.461829    1.134519    1.522516    3.125674
      val    19.50     1        87501          91775  100519     1     1  ...   -3.210214    2.942406    1.007788    0.901810    1.105968    1.198515    2.616581
761   res  9100.00     0        64035          94642  100757     1     1  ...   -0.372750   -1.881383   -1.552402    0.933195   -2.579973   -0.910605    1.146505
      val  9100.00     0        64035          94642  100757     1     1  ...   -0.399015   -1.882237   -1.528848    0.836805   -2.585224   -0.798560    1.101774
762   res    30.42     0        57394          91775  100607     1     1  ...    0.019567    0.589125    2.286478    0.793889    0.114065    0.116889   -2.347970
      val    30.42     0        57394          91775  100607     1     1  ...   -0.102164    0.462353    2.491218    0.233870    0.085513   -0.207112   -2.857063
764   res    25.74     1        74296          91784  100484     1     1  ...   -1.295459    1.133638    2.393340    0.342958    0.540464    2.828200   -1.021909
      val    25.74     1        74296          91784  100484     1     1  ...   -1.636326    1.124704    2.607330   -0.247052    0.490438    2.303633   -1.600628
765   res   191.62     1        70354          96216  100483     1     1  ...   -1.228315    0.555677    4.470200   -0.938110    1.238917   -0.028751   -2.252760
      val   191.62     1        70354          96216  100483     1     1  ...   -1.447451    0.673515    4.479450   -0.968101    1.217442   -0.229318   -2.322386
766   res   116.87     1        77701          91790  100482     1     1  ...    1.203686   -0.882477    0.483784   -0.034068    0.431342   -0.423109    1.098541
      val   116.87     1        77701          91790  100482     1     1  ...    1.494646   -0.591068    0.493658   -0.357471    0.327692   -0.152491    0.899048

[20 rows x 180 columns]

Full env printout

Click here to see environment details

 [Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

dagardner-nv commented 2 months ago

I tried running with a single thread and limiting the output to just the index and prediction fields still returns mismatched results:

Results do not match. Diff 64/265 (24.150943 %). First 10 mismatched rows:
           prediction
index                
753   res    0.448414
      val    0.141227
758   res    0.291023
      val    0.219923
762   res    0.010932
      val    0.001276
764   res    0.003048
      val    0.001248
766   res    0.001374
      val    0.004702
767   res    0.000720
      val    0.001853
771   res    0.001419
      val    0.002863
772   res    0.845840
      val    0.910736
777   res    0.001184
      val    0.036856
781   res    0.120291
      val    0.072192