jonathanking / sidechainnet

An all-atom protein structure dataset for machine learning.
BSD 3-Clause "New" or "Revised" License
322 stars 36 forks source link

Huggingface datasets integration? #55

Open ChenchaoZhao opened 1 year ago

ChenchaoZhao commented 1 year ago

Any plans for Huggingface datasets integration?

Instead of using pickled dictionary, probably it is better practice to use arrow or parquet format. It should be pretty easy to convert to Huggingface format.

jonathanking commented 1 year ago

Hi @ChenchaoZhao , thanks for your interest!

I have not considered either of those. I have found that the pickle format works well enough for my needs. Is there something in particular that makes using this format difficult? Also, if you are interested in contributing by converting the datasets to other formats, I would be happy to host them!

ChenchaoZhao commented 1 year ago

Hi @jonathanking thank you for the comment!

Pickle is not considered secure in production. How should I contribute if I generate the parquet files?

jonathanking commented 1 year ago

I was thinking about how to proceed, and here are my thoughts.

I'm going to release an updated version of SidechainNet in a little while. I think we can wait on creating parquet files until then. However, if you are really interested in contributing, you could perhaps write a function or describe how you might convert the current format (dictionary, key/values of various types) into a format agreeable with the parquet format. Then we could use that code/or general idea when we move forward and release the next version of the code and data.

I'm just not familiar with the format myself, so I'd have to investigate how to reformat the existing data. I see something about formatting it into a DataFrame and then writing a parquet file, so maybe it's not so complicated. It would just need to be able to handle the different kinds of data stored in the dictionary currently (arrays, lists, strings). Let me know what you think!

ChenchaoZhao commented 1 year ago

Will there be additional features in the next release? Based my understanding, the current version probably can be converted using Huggingface datasets.Dataset method from_dict see https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.from_dict

Then you can upload to Huggingface Hub for more visibility or save them as parquet format (the most compact format) or arrow format. They both support nested fields.

jonathanking commented 1 year ago

Yes, I have a handful of features and data standardizations/improvements that I’ve been working with on my research branches that I plan to add to the next release.

Thanks so much for pointing out that function! I didn’t think it would be that easy, but that sounds like a great option. I’ll keep that in mind for when I regenerate the data. I appreciate the help!