rdk / p2rank

P2Rank: Protein-ligand binding site prediction tool based on machine learning. Stand-alone command line program / Java library for predicting ligand binding pockets from protein structure.
https://rdk.github.io/p2rank/
MIT License
242 stars 34 forks source link

Implement csv-feature #34

Closed skodapetr closed 4 years ago

skodapetr commented 4 years ago

Implementation of a CSV feature that is able to load atom/residue feature values from external CSV files.

The basic idea is to allow a user to load a custom feature from an external file. From this perspective, it is very similar to conservation, but the aim here is to be more general and thus decrease the entry barrier for the development of new features.

The feature is called csv_file_atom_feature and must be added to extra features in order to be used. Next user must provide directories where the CSV files are located using csv_file_feature_directories options.

The whole command may then look like this:

traineval -t .\datasets\chen11.ds -e .\datasets\joined(mlig).ds -extra_features 'csv_file_atom_feature'  -csv_file_feature_directories ',.\custom-feature,' 

Multiple directories may be provided as a list. The commas in the argument are used as the command-line parser ignore the first and last character in the command line argument - also by default, it considers dot to be a separator, which is obviously not optimal for paths.

In the directory, there must be a file for each PDB structure. For example, if there is a PDB file called 148lE.pdb then a file called 148lE.pdb.csv must be in the directory.

The file must be a valid CSV file with the following header: "chain","ins. code","seq. code","{feature-name}" where {feature-name} can any valid column name (the value is not used). The chain, ins. code and seq.code columns are used to identify the residue.

The reader also supports per-atom features by using the following header "pdb serial","{feature-name}", where pdb serial refer to atom's serial number in the PDB file.

It is also possible to provide multiple value columns, with custom names, in a single file.

In case of an error, the feature throws the exception, the exception is logged but does not stop the execution - not even with -failFast 1. So that may be an improvement to be made.

rdk commented 4 years ago

This is great. I have only a couple of cosmetic suggestions.

Also, on other places (e.g. in output file predictions.csv) p2rank uses this format of unique residue identifier: "{chainid}{seq_num}{ins_code}", for example "C_124A". You can use this format instead of 3 columns "chain","ins. code","seq. code" but it's up to you.

Something sould be done about the formatting of list of strings parameters on the command line - somehow unify using "." / "," as separators. Also: the csv feature as it is implemented now doesn't allow to easily compare different csv feature sets within one run (in the same way as varying extra_features parameter does). This is for further discussion, but I was thinking about something like using sub-features within extra_features ant then passing it to csv feature implementation. For example: "-extra_features '(csv.feature1,csv.feature2)'" which would mean that you want only "feature1" and "feature2" columns from your csv files.

Regarding respecting -failFast 1: that is always a good idea, although not implemented everywhere in p2rank. But it souldn't be too complicated in this case, see for example cz/siret/prank/features/implementation/conservation/ConservationFeature.groovy:47

skodapetr commented 4 years ago

Ok, I'll change the names. The idea behind emphasizing "atom" features was, that there are to interfaces for features Atom and Residues .. I plan to implement csv_file_residue_feature as well, but it seems that the Residue feature support is not completed. The names were updated.

About the compact chain-atom-insCode format, there is already a student working with this format. That is why we may implement that later should there be a demand from the users.

As of the comparison of different features: I think it would be more beneficial if the feature can get actually given a name, as now even when multiple features are provided they seems to be a single feature. Is something like generate instances of AtomFeatureCalculator on the fly. I would leave it again based on the user experience, as I've mention we already have a user of this functionality.

As of the fail fast, you pointed to the location with code

            if (params.fail_fast) {
                throw new PrankException(msg)
            } else {
                log.warn msg
            }

the CsvFileFeature throw for every problem, I think that is a good behaviour as if there is a problem there should be a solution better then to ignore the problem.

rdk commented 4 years ago

I think that is a good behaviour as if there is a problem there should be a solution better then to ignore the problem.

Not sure what you are trying to say. I think it is a good behaviour to end with an error if you ask the program to fail fast. This is a necessity: say you are building a dataset with csv feature data, you want the program to tell you if there if your data is incomplete, inconsistent with pdb files or if there is problem with the format and parsing.

skodapetr commented 4 years ago

I think that is a good behaviour as if there is a problem there should be a solution better then to ignore the problem.

Not sure what you are trying to say. I think it is a good behaviour to end with an error if you ask the program to fail fast. This is a necessity: say you are building a dataset with csv feature data, you want the program to tell you if there if your data is incomplete, inconsistent with pdb files or if there is problem with the format and parsing.

I Agree, that is why I opt to throw the exception without checking the params.fail_fast at all.

rdk commented 4 years ago

Ah OK I haven't noticed it in the code. That is certainly better than not throwing an exception at all. But there is a case for having a possibility to ignore the problem with one specific feature on few specific proteins. Say you want to run predictions on whole PDB and you don't want to have the program fail on you each time there is an error.

I think P2Rank should make the best effort to produce the prediction when it is executed with param -fail_fast 0, even if some features failed to load or compute. From experience: model is robust enough to produce sensible predictions even then.

However, I think P2Rank should collect errors when it is executed with -fail_fast 0 and then somehow pass them to PrankWeb, so the user can see there were problems in making particular prediction.

skodapetr commented 4 years ago

Ah OK I haven't noticed it in the code. That is certainly better than not throwing an exception at all. But there is a case for having a possibility to ignore the problem with one specific feature on few specific proteins. Say you want to run predictions on whole PDB and you don't want to have the program fail on you each time there is an error.

Then I would expect that this would not be handled in the feature computation "layer" but the protein processing layer. The processing layer would capture the exception and log error skipping computation for the given protein.

I think P2Rank should make the best effort to produce the prediction when it is executed with param -fail_fast 0, even if some features failed to load or compute. From experience: model is robust enough to produce sensible predictions even then.

This is a matter of personal preference, I would prefer the method to fail rather than to produce possibly disguising data. Consider a situation where there is a systematic error in the processing of certain protein family. While this may not happen in p2rank the csv feature is external and thus outside of our control.

However, I think P2Rank should collect errors when it is executed with -fail_fast 0 and then somehow pass them to PrankWeb, so the user can see there were problems in making particular prediction.

Well, I'm not sure how this situation is handled right now. The easiest way is to use log messages, the best solution would be to store logs in a structured format and provide log browsers. But there is no demand for this functionality as of now as far as I know.