nasaharvest / dora

Domain-agnostic Outlier Ranking Algorithms (DORA) - SMD cross-divisional use case demonstration of AI/ML
MIT License
12 stars 3 forks source link

Implement DORA experiment pipeline/framework #5

Closed stevenlujpl closed 3 years ago

stevenlujpl commented 3 years ago
wkiri commented 3 years ago

@stevenlujpl Thanks for adding the copyright language to the source code! In addition to the all-caps paragraph, each file needs to have this part before it to indicate who the copyright holder is:

Copyright (c) 2021 California Institute of Technology (“Caltech”). U.S. Government sponsorship acknowledged. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. • Neither the name of Caltech nor its operating division, the Jet Propulsion Laboratory, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

wkiri commented 3 years ago

@stevenlujpl I am looking into updating the DEMUD code to remove the cosmic_demud dependency. Forgive this basic question, but how do I run the code in its new configuration? Inside the src/ directory and the cif-venv virtual environment, I get:

$ python simulator.py 
Traceback (most recent call last):
  File "simulator.py", line 14, in <module>
    from src.sim_config import SimulatorConfig
ImportError: No module named src.sim_config

The src/ directory is not configured as a Python module, so python -m src (from the enclosing directory) does not work. You must have some other way of running it :)

stevenlujpl commented 3 years ago

@wkiri I haven't checked in all my code yet. At this point, I don't think we can run DEMUD with the DORA framework yet. However, each outlier detection algorithm should have its own command line interface to run. Can you use that for now?

wkiri commented 3 years ago

@stevenlujpl No, I get the same kind of error:

$ python demud_ranking.py 
Traceback (most recent call last):
  File "demud_ranking.py", line 16, in <module>
    from src.ranking import Ranking
ImportError: No module named src.ranking
wkiri commented 3 years ago

@stevenlujpl I also tried Python3, but the CIF virtualenv doesn't have the necessary packages installed to support this - or maybe you plan for a DORA virtualenv that would have the Python3 packages. Feel free to point me to how you are currently running it and I will use that method.

wkiri commented 3 years ago

As another suggestion, I recommend making the out_dir argument required instead of optional, for all scripts. Currently it has a default of the current directory ("."), which will always give this error:

> python demud_ranking.py
Traceback (most recent call last):
  File "demud_ranking.py", line 176, in <module>
    main()
  File "demud_ranking.py", line 172, in main
    start(**vars(args))
  File "demud_ranking.py", line 139, in start
    **demud_params)
  File "/home/wkiri/Research/DORA/git/src/ranking.py", line 87, in run
    enable_explanation=False)
  File "/home/wkiri/Research/DORA/git/src/util.py", line 121, in save_results
    os.mkdir(out_dir)
OSError: [Errno 17] File exists: '.'

Making out_dir required should enable the avoidance of a runtime error with default arguments.

hannah-rae commented 3 years ago

The CIF implementation for DORA shuffled the indices and returned the shuffled indices with a score of 0.0 for all samples. Since in DORA all algorithms currently only return the scores, I returned the indices as the scores so they will be sorted with the random order.

This can be updated to return shuffled indices and 0.0 scores if the algorithms are updated to return both sel_ind and scores. See discussion on Slack:

How about just returning sel_ind and scores associated with those selections as in the CIF code? The CIF algs (except DEMUD) were I believe sorting by score internally (because we were ranking), but this step could be skipped and they could return [0,1,2,3...] as sel_ind with [score_0,score_1,....] as scores and then the Results Org could decide whether to do the sorting. That way the cost of sorting only happens if the user wants it.

urebbapr commented 3 years ago

Update, we now have a working loader for the astronomy use case, called the CatalogLoader. It assumes the input is .h5, however. Not .csv. So, I didn't check the box @stevenlujpl had marked for .csv data.

hannah-rae commented 3 years ago

Thanks @urebbapr! Is there any reason that this wouldn't support other feature vector datasets as well, not just astronomy catalogs? If not, do you think it makes sense to just have a FeatureVectorLoader that is used for the catalog but also supports other datasets?

urebbapr commented 3 years ago

You're absolutely right @hannah-rae. I can rename it to FeatureVectorLoader.