phi-grib / flame

Modeling framework for eTRANSAFE project
GNU General Public License v3.0
12 stars 10 forks source link

Predict #88

Closed bet-gregori closed 5 years ago

bet-gregori commented 5 years ago

When predicting with the command line tool, flame looks for the activity field in the SDfile. However, since it is precisely what I'm trying to predict my query compounds do not have an activity value field.

kpinto-gil commented 5 years ago

(flame) etoxws-v2:~/soft/flame_ws/mols # flame -c predict -e BSEP_upf -f minicaco_0_std.sdf INFO - Starting prediction with model BSEP_upf version 0 for file minicaco_0_std.sdf INFO - Running with input type: molecule Traceback (most recent call last): File "/opt/anaconda2/envs/flame/bin/flame", line 11, in load_entry_point('flame==0.1', 'console_scripts', 'flame')() File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/flame_scr.py", line 167, in main success, results = context.predict_cmd(model) File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/context.py", line 108, in predict_cmd success, results = predict.run(model['infile']) File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/predict.py", line 88, in run results = idata.run() File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/idata.py", line 1131, in run self._run_molecule() File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/idata.py", line 836, in _run_molecule success_inform = self.extractInformation(self.ifile) File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/idata.py", line 125, in extractInformation activity_num = utils.get_sdf_activity_value(mol, self.parameters) File "/opt/anaconda2/envs/flame/lib/python3.6/site-packages/flame/util/utils.py", line 322, in get_sdf_activity_value raise ValueError(f"SDFile_activity parameter '{parameters['SDFile_activity']}'" ValueError: SDFile_activity parameter 'Activity' not found in input SDF.Change SDFile_activity param in parameter.yml to match the target prop in SDF

bet-gregori commented 5 years ago

I've tried to put a dummy 'activity' field in the query SD file, but in the output.tsv file the prediction (ymatrix column) is the dummy activity value I've entered in the query SD file.

kpinto-gil commented 5 years ago

I think there is a problem in:

utils.py: Here, if you don't define SDFile_activity parameter, the code raised an error message and does not continue.

def get_sdf_activity_value(mol, parameters: dict) -> float: """ Checks if activity prop is the same in parameters and SDF input file

Returns activity value as float if possible
"""
if mol.HasProp(parameters['SDFile_activity']):
    # get sdf activity field value
    activity_str = mol.GetProp(parameters['SDFile_activity'])
    try:
        # cast val to float to be sure it is num
        activity_num = float(activity_str)
    except Exception as e:
        LOG.error('while casting activity to'
                  f' float an exception has ocurred: {e}')
        activity_num = None
# defence when prop is not in parameter file
else:  # SDF doesn't have param prop name
    raise ValueError(f"SDFile_activity parameter '{parameters['SDFile_activity']}'"
                     " not found in input SDF."
                     "Change SDFile_activity param in parameter.yml"
                     " to match the target prop in SDF")

return activity_num

in idata.py: Here, if you don't define SDFile_experimental parameter, the code crashes if mol.HasProp(self.parameters['SDFile_experimental']): exp = mol.GetProp(self.parameters['SDFile_experimental']) LOG.debug('Found experimental results in SDF')

BielStela commented 5 years ago

well actually the prediction it's a little bit of a mess:

  1. [.....] goes to Predict()
  2. Predict calls Apply to "apply" the prediction computation
  3. Apply uses the model pickle to load the pickle inside a function called run_internal()
  4. Then it projects (?) the input data to results ( this is where the learner is called)
  5. The learner (or estimator) is a custom class that have a custom base class
  6. where does the results of this goes? whats going on?:
    estimatorr.project(X, self.results)
  7. then run_internal runs external_validation (??)
  8. external validation runs if self.results(????) has an ymatrix (???)

Given this comment just below:

 # TODO: implement this for every prediction

flame runs this external_validation every s i n g l e time it has to do a predict?

what is external validation??

if it's testing the model with a non seen dataset (with labels) it should be placed in the learning, not in predict module.

BielStela commented 5 years ago

The predict workflow should be clarified, cleaned and documented

kpinto-gil commented 5 years ago

Output terminal predict external validation: /home/kpinto/miniconda3/envs/kpi36/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release. from numpy.core.umath_tests import inner1d (88, 111) (88, 111) INFO - Prediction finished. flame predict : True

I realized that:

If external validation is performed:

manuelpastor commented 5 years ago

This was never a bug, but a series of missunderstandings about the program behaviour:

  1. Bet, the predictions appear in the command line output labelled as "values". When activities are also included, they are also shown
  2. Somebody moddified the code to produce an error when Y values are not found. This is not correct and has been ammended. Now, you can predict when activity values are present or not. In the latter case, no external prediction is attempted
  3. Biel, the predict code seems very straightforward to me, but the documentation is being improved
  4. Kevin, I cannot reproduce your output. Please provide more details. Your suggestions about the TSV labelling must be reported elsewhere