Closed YikunHan42 closed 1 month ago
Thank you, glad to see you're interested in this work! I should make this more clearer, the processed, filtered, deduplicated version of the GS-LF dataset is actually data/datasets/NaNs_GS_LF_isomeric_SMILES_dedup_odor_filtered.csv
. I'll rename it and add docs to make that clearer soon.
Thanks for the clarification! I just had one more question: for the new dataset, starting from row 4567, it seems that the labels from 'bland' (column C) to 'soapy' (column AR) are still missing. Is that the intended outcome of the processing/filtering you mentioned?
Thanks again for your prompt help!
Yes, that sounds right, because we are merging two datasets with many non-overlapping odorants + percepts (goodscents, leffingwell). In the process, there end up being many odorant molecules with distinct percept labels from the two that we still want to train on, and so we apply a mask at training time over the ones for each odorant where no labelled data exists.
Let me know if this answers your question! Closing this issue for now.
Let me know if this answers your question! Closing this issue for now.
Thanks!
Thanks for the great work on this repo! I noticed something in the
data/datasets/goodscents_leffingwell_all_percepts.csv
file. The dataset contains 5862 molecules, as mentioned in the paper, but starting from row 4567, it seems like the odor descriptors are missing. Could you please clarify if this is expected or if there might be an issue with the data?Thank you in advance!