Closed cwognum closed 1 year ago
I think ordering any list before serialization is likely the way to go. This is also what they do on qcelemental
(to hash a Python complex types such as list, dict, etc, I recommend serializing it to JSON first)
On List[Tuple[int, List[int]]]
, shouldn't that be a simpler List[int]
instead? I think we already discussed it, but I still don't understand why do we need that second dimension for the targets since that information is already available with target_cols
.
Find below a quick prototype:
import json
import hashlib
l1 = [5, 8, 6, 44, 2, 34]
m = hashlib.sha1()
s = json.dumps(sorted(l1), default=lambda x: x.ravel().tolist())
m.update(s.encode("utf-8"))
m.hexdigest()
# 4d14056d80b733928d468427dad1506ea15586c8
l2 = [5, 8, 44, 6, 2, 34]
m = hashlib.sha1()
s = json.dumps(sorted(l2), default=lambda x: x.ravel().tolist())
m.update(s.encode("utf-8"))
m.hexdigest()
# 4d14056d80b733928d468427dad1506ea15586c8
On List[Tuple[int, List[int]]], shouldn't that be a simpler List[int] instead? I think we already discussed it, but I still don't understand why do we need that second dimension for the targets since that information is already available with target_cols.
target_cols
is to indicate the set of all possible targets. A multi-task split might put a single datapoint in both the train and test set, but with a different subset of the targets. It's worth noting that you do not have to specify a multi-task split as such. List[int]
works too and is converted to the expected format by assuming you want to specify all targets.
I expect many ML-practitioners to consider such a split data leakage and bad practice, but I can see it be a representative setting of real life where you start with a sparse matrix and want to predict the missing values (e.g. drug-drug synergies).
Ok got it. As you said, I don't think it's something we should encourage to prevent data leakage and I would expect most of our benchmarks to not do that. As for tasks that require pairs (drug-drug for example), then the inputs will be two columns, no?
For sparse target columns, maybe we should have mask columns instead?
As for tasks that require pairs (drug-drug for example), then the inputs will be two columns, no?
Yeah, that's a good point! I like that better as it will simplify the API quite a bit as well.
For sparse target columns, maybe we should have mask columns instead?
This is actually something I was about to open an issue for, as I encountered it yesterday. I was creating a single-task BenchmarkSpecification
from a multi-task Dataset
, which resulted in NaN values. I'll open a separate issue and will xref it here.
xref: #11
For sparse target columns, maybe we should have mask columns instead?
But actually, that's not an issue. For a multi-task dataset, I think it makes more sense to always return a (possibly NaN) value for each of the endpoints and to let the user handle the NaNs.
Currently, the split is not used in computing the
md5sum
checksum of aBenchmarkSpecification
. See here.The reason for this is that the split type is quite complex (see here). During data validation (through Pydantic's
@validator
), the splits are preprocessed to one of two formats:SingleTaskBenchmarkSpecification
, it is aList[int]
MultiTaskBenchmarkSpecification
, it is aList[Tuple[int, List[int]]]
Besides the complex format, we also want to make sure that the hash is permutation-invariant.