papercopilot / paperlists

21 stars 0 forks source link

huggingface Dataset #2

Open hunoutl opened 1 month ago

hunoutl commented 1 month ago

Thank you for this repo which saved me time to do a quick analysis on CVPR24. I would like to give back to the community and so I made a first draft of what paperlists could produce in Huggingface datasets format : https://huggingface.co/datasets/hunoutl/paperlists

It is a raw dataset. I kept most of the keys that I applied to all of the papers. Do you have an idea of ​​what would be possible to standardize everything?

For now I have a simple code for merging, I will try to find time to make it cleaner and share it. I had made synthetic data for CVPR with the use of LLM to complete information and add new ones (country of belonging). I'm thinking of going over all the papers in the future.

jingyangcarl commented 1 month ago

Hi @hunoutl

Thanks for using this papercopilot/paperlist repository. It's great to have it hosted on Hugging Face (huggingface/papercopilot) as well. I actually started an organization on Hugging Face, but I haven't posted anything there yet, lol.

I've also spent some time thinking about whether we can standardize everything during development, and I believe we can. This paper list is powered by papercopilot/paperbot and is currently organized into modules by conference.

I used to put all conference papers into a large data table and use the title as the key. However, there's a chance that papers could share the same title, making it difficult to identify missing papers. Therefore, I split them from the big standard output into smaller shards.

Still, it would be good to have a standard output as a function of the paperbot at output time to make it an easy-to-use tool.

I welcome to your contribution.

Best, Jing