nasaharvest / dora

Domain-agnostic Outlier Ranking Algorithms (DORA) - SMD cross-divisional use case demonstration of AI/ML
MIT License
10 stars 3 forks source link

Unify JPL and UMD virtual computing environments #44

Open wkiri opened 3 years ago

wkiri commented 3 years ago

We are seeing slightly different behavior in DORA runs between the JPL and UMD virtual environments, which probably means a different version of some Python package(s) are employed, and likely can be resolved by updating setup.py to specify those package versions as well.

@hannah-rae noted these issues:

wkiri commented 3 years ago

@hannah-rae Was this resolved?

hannah-rae commented 3 years ago

No, not yet, but it is on my to do list.

bdubayah commented 3 years ago

I think the issue might be partly related to how os.listdir or glob.glob lists directories on different machines (used when the image loader loads images). When listing the files in the fmnist or planetary test directory, I get different orderings on the UMD cluster vs my local machine. This causes a labels.csv file to be totally wrong between machines. One fix could be sorting directory contents once they're loaded, and making sure labels correspond to that order. Or, labels files could use the filename/sample id rather than it's index.

jakehlee commented 3 years ago

@wkiri and I ran into this when running experiments for our DEMUD paper - every glob.glob() or os.listdir() call should be wrapped by a sorted(), the lists/iterators they return is in some arbitrary order determined by the individual filesystem.

https://docs.python.org/3/library/os.html#os.listdir

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order...

wkiri commented 3 years ago

My preference would be for labels.csv to use an identifier for each item (as noted by @bdubayah) instead of relying on ordering. In addition to increasing robustness across machines, it would mean we can more easily change the experiment to include/exclude items without having to regenerate every line in this file. This makes sense for individually named items like the images in an image data set. It's less clear how it would work for some of our other data set types. Ideas welcome :)

bdubayah commented 3 years ago

What does everyone think of this approach? I changed a few lines in the data loader so that each sample would have a string id (just converted the sample indexes to a string for tabular data), and then in the results organization used data id rather than data index to make the comparison plot (so you could run a modified experiment with a exhaustive labels file). We would still need to change the combined plot script but wanted to get thoughts first.

wkiri commented 2 years ago

@bdubayah Yes, this looks great!

I think the update in dora_results_organization.py to read string names instead of integers from the labels file should also occur in combined_plot_script.py. It looks like some additional changes are needed to the latter script too. I will work on this. In the meantime, is this branch ready to merge? (issue44-unify-envs)

bdubayah commented 2 years ago

Yes, it's good to go (aside from the combined plot issues you mentioned). I think the labels files for the experiments will need to be updated as well.

wkiri commented 2 years ago

That's right. I'm updating the planetary experiment label files, but it's a good point that this will trigger updates needed for the other use cases too.

bdubayah commented 2 years ago

👍 Should I merge this to master or did you want to include the combined plot script in the same PR?

wkiri commented 2 years ago

@bdubayah Let me commit the updates to that script. It's worth alerting the team that this merge may break compatibility with experiments until folks update their label files, too.

wkiri commented 2 years ago

@bdubayah Ok, it should be ready if you want to take a look.

Note that I also changed the y axis to start from 0, since it's possible for an algorithm to not select at least one novel item in the beginning.

bdubayah commented 2 years ago

Looks good to me!

wkiri commented 2 years ago

@bdubayah Feel free to PR when ready!