Goodscents-Leffingwell Dataset

microsoft / olfaction

Code for paper "Mapping the combinatorial coding between olfactory receptors and perception with deep learning"

MIT License

6 stars 1 forks source link

Goodscents-Leffingwell Dataset #4

Closed YikunHan42 closed 1 month ago

YikunHan42 commented 1 month ago

Thanks for the great work on this repo! I noticed something in the data/datasets/goodscents_leffingwell_all_percepts.csv file. The dataset contains 5862 molecules, as mentioned in the paper, but starting from row 4567, it seems like the odor descriptors are missing. Could you please clarify if this is expected or if there might be an issue with the data?

Thank you in advance!

seyonechithrananda commented 1 month ago

Thank you, glad to see you're interested in this work! I should make this more clearer, the processed, filtered, deduplicated version of the GS-LF dataset is actually data/datasets/NaNs_GS_LF_isomeric_SMILES_dedup_odor_filtered.csv. I'll rename it and add docs to make that clearer soon.

YikunHan42 commented 1 month ago

Thanks for the clarification! I just had one more question: for the new dataset, starting from row 4567, it seems that the labels from 'bland' (column C) to 'soapy' (column AR) are still missing. Is that the intended outcome of the processing/filtering you mentioned?

Thanks again for your prompt help!

seyonechithrananda commented 1 month ago

Yes, that sounds right, because we are merging two datasets with many non-overlapping odorants + percepts (goodscents, leffingwell). In the process, there end up being many odorant molecules with distinct percept labels from the two that we still want to train on, and so we apply a mask at training time over the ones for each odorant where no labelled data exists.

seyonechithrananda commented 1 month ago

Let me know if this answers your question! Closing this issue for now.

YikunHan42 commented 1 month ago

Let me know if this answers your question! Closing this issue for now.

Thanks!