Unify JPL and UMD virtual computing environments

nasaharvest / dora

Domain-agnostic Outlier Ranking Algorithms (DORA) - SMD cross-divisional use case demonstration of AI/ML

MIT License

10 stars 3 forks source link

Unify JPL and UMD virtual computing environments #44

Open wkiri opened 3 years ago

wkiri commented 3 years ago

We are seeing slightly different behavior in DORA runs between the JPL and UMD virtual environments, which probably means a different version of some Python package(s) are employed, and likely can be resolved by updating setup.py to specify those package versions as well.

@hannah-rae noted these issues:

When running the planetary_rover PNG sample case, the PNG images seem to be read in using a different ordering
When analyzing the planetary_rover PNG images, the selections are roughly the same but the scores are different.

wkiri commented 3 years ago

@hannah-rae Was this resolved?

hannah-rae commented 3 years ago

No, not yet, but it is on my to do list.

bdubayah commented 3 years ago

I think the issue might be partly related to how os.listdir or glob.glob lists directories on different machines (used when the image loader loads images). When listing the files in the fmnist or planetary test directory, I get different orderings on the UMD cluster vs my local machine. This causes a labels.csv file to be totally wrong between machines. One fix could be sorting directory contents once they're loaded, and making sure labels correspond to that order. Or, labels files could use the filename/sample id rather than it's index.

jakehlee commented 3 years ago

@wkiri and I ran into this when running experiments for our DEMUD paper - every glob.glob() or os.listdir() call should be wrapped by a sorted(), the lists/iterators they return is in some arbitrary order determined by the individual filesystem.

https://docs.python.org/3/library/os.html#os.listdir

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order...

wkiri commented 3 years ago

My preference would be for labels.csv to use an identifier for each item (as noted by @bdubayah) instead of relying on ordering. In addition to increasing robustness across machines, it would mean we can more easily change the experiment to include/exclude items without having to regenerate every line in this file. This makes sense for individually named items like the images in an image data set. It's less clear how it would work for some of our other data set types. Ideas welcome :)

bdubayah commented 3 years ago

What does everyone think of this approach? I changed a few lines in the data loader so that each sample would have a string id (just converted the sample indexes to a string for tabular data), and then in the results organization used data id rather than data index to make the comparison plot (so you could run a modified experiment with a exhaustive labels file). We would still need to change the combined plot script but wanted to get thoughts first.