Add PAE ranking method - Githubissues

wkiri commented 3 years ago

@bdubayah Thank you for this! I ran it on the planetary rover Navcam images. We don't have the right CUDA drivers on our machine, so I think it is falling back to CPU mode (which is what we want). Do I need to provide any additional arguments to make this happen? I got a lot of info/warning messages on my console as follows. However, I did get selection results. Can I trust them? If these errors are harmless (just indicating fallback to CPU mode), could they be caught/suppressed (or moved to the log file) and replaced with a single message indicating "GPU support unavailable, falling back to CPU mode"?

2021-09-13 11:12:22.591780: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:12:22.591841: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Loading data_to_fit
Loading data_to_score
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:01<00:00,  1.16s/it]
Feature extraction: 100%|█████████████████████████████████| 1/1 [00:00<00:00, 15.73it/s]
Outlier detection:  50%|█████████████████                 | 2/4 [00:04<00:03,  1.88s/it]2021-09-13 11:13:01.516518: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-09-13 11:13:01.621985: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:0a:00.0 name: Tesla M60 computeCapability: 5.2
coreClock: 1.1775GHz coreCount: 16 deviceMemorySize: 7.94GiB deviceMemoryBandwidth: 149.31GiB/s
2021-09-13 11:13:01.622467: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.623958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.625606: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.629197: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-09-13 11:13:01.631525: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-09-13 11:13:01.632709: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.633911: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635503: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/wkiri/.local/lib:/usr/local/lib:/usr/lib:/usr/lib64:/usr/local/caffe/lib
2021-09-13 11:13:01.635766: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-13 11:13:01.636569: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-13 11:13:01.639303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-13 11:13:01.639371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]     

2021-09-13 11:13:02.467535: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-13 11:13:02.468648: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3196415000 Hz

2021-09-13 11:13:23.519885: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.

Outlier detection: 100%|██████████████████████████████████| 4/4 [00:33<00:00,  8.28s/it]

bdubayah commented 3 years ago

@wkiri The scores should be fine - If it wasn't training it would give all nans for scores. I thought I had disabled logging/warnings as much as possible but I'm going to go back in and see if it can be locked down further/try to add a more descriptive message - thanks for finding!

wkiri commented 3 years ago

Hm, I looked more deeply into the results files, and indeed, I see "nan" for all of the scores. So I suspect it is not working correctly in CPU mode. Could you look into this? Here is the file: https://github.com/nasaharvest/dora/blob/master/exp/planetary_rover/results/pae-latent_dim%3D5/selections-pae.csv

Here is how to reproduce the experiment I did (but I recommend (1) running on a machine without GPUs and (2) commenting out all algs in the config file except PAE to reduce your time waiting for results):

$ python3 dora_exp_pipeline/dora_exp.py -o exp/planetary_rover/results -l planetary-last10sols.log exp/planetary_rover/planetary-last10sols.config

wkiri commented 3 years ago

In the short term, if you can generate updated PAE results while running on a GPU machine for this data set, that would work too! :) (You could just check in an updated selections-pae.csv file).

bdubayah commented 3 years ago

I think the nans are actually coming from the flow training failing because the pixel values aren't being scaled to [0-1] (I've been training on a CPU most of the time and haven't had issues). I talked about adding this to the PAE in the meeting yesterday, but after giving it some thought today I think the best way is to add a pixel normalization parameter to the image data loader. Also, I assume the experiment is being run on a larger set of images than just the ones in the sample_data dir? Are these available anywhere so I can make sure it runs?

wkiri commented 3 years ago

@bdubayah Yes, the normalization issue might be the culprit!

The files for this experiment are on the JPL servers. See config file here: https://github.com/nasaharvest/dora/tree/master/exp/planetary_rover/

If you don't have JPL access, I can zip up the files and send them to you later today.

bdubayah commented 3 years ago

@wkiri I don't think I have JPL access so that would be great if you could send them over! For what it's worth, the model converges for me even on the really small sample dataset (once I added normalization), but it would be nice just to confirm on the bigger dataset too. I'll be able to push out the fix a bit later today.

wkiri commented 3 years ago

@bdubayah Great, I just sent you an email with the (larger set of) image files.

bdubayah commented 3 years ago

Hi @wkiri, I added the changes to the PAE and re-ran the experiment (see most recent commit). My only concern is that I added an additional option to the flattened pixel values extractor to normalize pixels to [0,1] and the MDRs for the algorithms decreased a little (https://github.com/nasaharvest/dora/blob/5cf124cea699d2ffc1d7c3d6156a25e667e7beb5/exp/planetary_rover/results/comparison_plot_combined.png). Not sure if this is expected or not - I can move the normalization to the PAE, I was just thinking there might be some data types where the user might not want values normalized to [0,1].

wkiri commented 3 years ago

@bdubayah Thanks! It is not surprising that the numeric scores would change for some algorithms (esp. those that report reconstruction error, like PCA or DEMUD) but I am surprised that the order of selections has changed quite a bit. The MDRs have not only decreased, but there is much less performance separation between algorithms. The order has even changed for "random", which suggest to me that the differences may be due to Python environment/packages rather than the normalization. This may be related to issue #44 .

I ran with just the normalization employed and I get the same results on all non-PAE algorithms as without normalization. The PAE algorithm's performance improved significantly (and the scores are not NaNs).

I think you can proceed to PR/merge this fix. If anyone does not want pixel normalization (which doesn't affect most algorithms anyway), we could discuss/revert that global change if needed.

bdubayah commented 3 years ago

Todo at this point:

[ ] Adjust latent dim based on number of features
[ ] Add more descriptive messages about gpu (i.e. resolve GPU warnings/messages) - maybe @PaHorton's pytorch implementation would resolve this?
[ ] Add no- normalizing flow option for convolutional (issues with loss propagation)

hannah-rae commented 3 years ago

@bdubayah are you still working on the above tasks or is this ready to be closed?

bdubayah commented 3 years ago

@hannah-rae Still working on them!

nasaharvest / dora

Add PAE ranking method #13