possibly group by inchikey for known_smiles instead of smiles column?
we may wish to specify dtypes explicitly instead of relying on df = pd.DataFrame(data, columns=columns).convert_dtypes() (have a tuple as the value in molecular_properties dict, for example).
Line 214 in calculate_outcomes:
"% unique": len(bin_df) / len(bin_df), should be "% unique": len(bin_df) / len(bin_df["size"].sum())
diff for calculate_outcomes.csv is now as we would expect. The last bullet point above is probably worth fixing before we merge - the remaining 2 can be made into issues.
Points we discussed:
inchikey
for known_smiles instead ofsmiles
column?df = pd.DataFrame(data, columns=columns).convert_dtypes()
(have a tuple as the value inmolecular_properties
dict, for example).calculate_outcomes
:"% unique": len(bin_df) / len(bin_df),
should be"% unique": len(bin_df) / len(bin_df["size"].sum())