Open hannah-rae opened 3 years ago
@wkiri The scores should be fine - If it wasn't training it would give all nans for scores. I thought I had disabled logging/warnings as much as possible but I'm going to go back in and see if it can be locked down further/try to add a more descriptive message - thanks for finding!
Hm, I looked more deeply into the results files, and indeed, I see "nan" for all of the scores. So I suspect it is not working correctly in CPU mode. Could you look into this? Here is the file: https://github.com/nasaharvest/dora/blob/master/exp/planetary_rover/results/pae-latent_dim%3D5/selections-pae.csv
Here is how to reproduce the experiment I did (but I recommend (1) running on a machine without GPUs and (2) commenting out all algs in the config file except PAE to reduce your time waiting for results):
$ python3 dora_exp_pipeline/dora_exp.py -o exp/planetary_rover/results -l planetary-last10sols.log exp/planetary_rover/planetary-last10sols.config
In the short term, if you can generate updated PAE results while running on a GPU machine for this data set, that would work too! :) (You could just check in an updated selections-pae.csv
file).
I think the nans are actually coming from the flow training failing because the pixel values aren't being scaled to [0-1] (I've been training on a CPU most of the time and haven't had issues). I talked about adding this to the PAE in the meeting yesterday, but after giving it some thought today I think the best way is to add a pixel normalization parameter to the image data loader. Also, I assume the experiment is being run on a larger set of images than just the ones in the sample_data dir? Are these available anywhere so I can make sure it runs?
@bdubayah Yes, the normalization issue might be the culprit!
The files for this experiment are on the JPL servers. See config file here: https://github.com/nasaharvest/dora/tree/master/exp/planetary_rover/
If you don't have JPL access, I can zip up the files and send them to you later today.
@wkiri I don't think I have JPL access so that would be great if you could send them over! For what it's worth, the model converges for me even on the really small sample dataset (once I added normalization), but it would be nice just to confirm on the bigger dataset too. I'll be able to push out the fix a bit later today.
@bdubayah Great, I just sent you an email with the (larger set of) image files.
Hi @wkiri, I added the changes to the PAE and re-ran the experiment (see most recent commit). My only concern is that I added an additional option to the flattened pixel values extractor to normalize pixels to [0,1] and the MDRs for the algorithms decreased a little (https://github.com/nasaharvest/dora/blob/5cf124cea699d2ffc1d7c3d6156a25e667e7beb5/exp/planetary_rover/results/comparison_plot_combined.png). Not sure if this is expected or not - I can move the normalization to the PAE, I was just thinking there might be some data types where the user might not want values normalized to [0,1].
@bdubayah Thanks! It is not surprising that the numeric scores would change for some algorithms (esp. those that report reconstruction error, like PCA or DEMUD) but I am surprised that the order of selections has changed quite a bit. The MDRs have not only decreased, but there is much less performance separation between algorithms. The order has even changed for "random", which suggest to me that the differences may be due to Python environment/packages rather than the normalization. This may be related to issue #44 .
I ran with just the normalization employed and I get the same results on all non-PAE algorithms as without normalization. The PAE algorithm's performance improved significantly (and the scores are not NaNs).
I think you can proceed to PR/merge this fix. If anyone does not want pixel normalization (which doesn't affect most algorithms anyway), we could discuss/revert that global change if needed.
Todo at this point:
@bdubayah are you still working on the above tasks or is this ready to be closed?
@hannah-rae Still working on them!
@bdubayah Thank you for this! I ran it on the planetary rover Navcam images. We don't have the right CUDA drivers on our machine, so I think it is falling back to CPU mode (which is what we want). Do I need to provide any additional arguments to make this happen? I got a lot of info/warning messages on my console as follows. However, I did get selection results. Can I trust them? If these errors are harmless (just indicating fallback to CPU mode), could they be caught/suppressed (or moved to the log file) and replaced with a single message indicating "GPU support unavailable, falling back to CPU mode"?