Open ChenchaoZhao opened 1 year ago
Hi @ChenchaoZhao , thanks for your interest!
I have not considered either of those. I have found that the pickle format works well enough for my needs. Is there something in particular that makes using this format difficult? Also, if you are interested in contributing by converting the datasets to other formats, I would be happy to host them!
Hi @jonathanking thank you for the comment!
Pickle is not considered secure in production. How should I contribute if I generate the parquet files?
I was thinking about how to proceed, and here are my thoughts.
I'm going to release an updated version of SidechainNet in a little while. I think we can wait on creating parquet files until then. However, if you are really interested in contributing, you could perhaps write a function or describe how you might convert the current format (dictionary, key/values of various types) into a format agreeable with the parquet format. Then we could use that code/or general idea when we move forward and release the next version of the code and data.
I'm just not familiar with the format myself, so I'd have to investigate how to reformat the existing data. I see something about formatting it into a DataFrame and then writing a parquet file, so maybe it's not so complicated. It would just need to be able to handle the different kinds of data stored in the dictionary currently (arrays, lists, strings). Let me know what you think!
Will there be additional features in the next release? Based my understanding, the current version probably can be converted using Huggingface datasets.Dataset
method from_dict
see https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.from_dict
Then you can upload to Huggingface Hub for more visibility or save them as parquet
format (the most compact format) or arrow
format. They both support nested fields.
Yes, I have a handful of features and data standardizations/improvements that I’ve been working with on my research branches that I plan to add to the next release.
Thanks so much for pointing out that function! I didn’t think it would be that easy, but that sounds like a great option. I’ll keep that in mind for when I regenerate the data. I appreciate the help!
Any plans for Huggingface
datasets
integration?Instead of using pickled dictionary, probably it is better practice to use
arrow
orparquet
format. It should be pretty easy to convert to Huggingface format.