Use the split in computing the md5sum of a `BenchmarkSpecification`

polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.

https://polaris-hub.github.io/polaris/

Apache License 2.0

82 stars 4 forks source link

Use the split in computing the md5sum of a `BenchmarkSpecification` #10

Closed cwognum closed 1 year ago

cwognum commented 1 year ago

Currently, the split is not used in computing the md5sum checksum of a BenchmarkSpecification. See here.

The reason for this is that the split type is quite complex (see here). During data validation (through Pydantic's @validator), the splits are preprocessed to one of two formats:

For SingleTaskBenchmarkSpecification, it is a List[int]
For MultiTaskBenchmarkSpecification, it is a List[Tuple[int, List[int]]]
- The first dimension corresponds to the datapoint
- The second dimension corresponds to the targets

Besides the complex format, we also want to make sure that the hash is permutation-invariant.

hadim commented 1 year ago

I think ordering any list before serialization is likely the way to go. This is also what they do on qcelemental

(to hash a Python complex types such as list, dict, etc, I recommend serializing it to JSON first)

On List[Tuple[int, List[int]]], shouldn't that be a simpler List[int] instead? I think we already discussed it, but I still don't understand why do we need that second dimension for the targets since that information is already available with target_cols.

Find below a quick prototype:

import json
import hashlib

l1 = [5, 8, 6, 44, 2, 34]

m = hashlib.sha1()
s = json.dumps(sorted(l1), default=lambda x: x.ravel().tolist())
m.update(s.encode("utf-8"))
m.hexdigest()
# 4d14056d80b733928d468427dad1506ea15586c8

l2 = [5, 8, 44, 6, 2, 34]

m = hashlib.sha1()
s = json.dumps(sorted(l2), default=lambda x: x.ravel().tolist())
m.update(s.encode("utf-8"))
m.hexdigest()
# 4d14056d80b733928d468427dad1506ea15586c8

cwognum commented 1 year ago

On List[Tuple[int, List[int]]], shouldn't that be a simpler List[int] instead? I think we already discussed it, but I still don't understand why do we need that second dimension for the targets since that information is already available with target_cols.

target_cols is to indicate the set of all possible targets. A multi-task split might put a single datapoint in both the train and test set, but with a different subset of the targets. It's worth noting that you do not have to specify a multi-task split as such. List[int] works too and is converted to the expected format by assuming you want to specify all targets.

I expect many ML-practitioners to consider such a split data leakage and bad practice, but I can see it be a representative setting of real life where you start with a sparse matrix and want to predict the missing values (e.g. drug-drug synergies).

hadim commented 1 year ago

Ok got it. As you said, I don't think it's something we should encourage to prevent data leakage and I would expect most of our benchmarks to not do that. As for tasks that require pairs (drug-drug for example), then the inputs will be two columns, no?

For sparse target columns, maybe we should have mask columns instead?

cwognum commented 1 year ago

As for tasks that require pairs (drug-drug for example), then the inputs will be two columns, no?

Yeah, that's a good point! I like that better as it will simplify the API quite a bit as well.

For sparse target columns, maybe we should have mask columns instead?

This is actually something I was about to open an issue for, as I encountered it yesterday. I was creating a single-task BenchmarkSpecification from a multi-task Dataset, which resulted in NaN values. I'll open a separate issue and will xref it here.

cwognum commented 1 year ago

xref: #11

For sparse target columns, maybe we should have mask columns instead?

But actually, that's not an issue. For a multi-task dataset, I think it makes more sense to always return a (possibly NaN) value for each of the endpoints and to let the user handle the NaNs.