Open iamsiva11 opened 7 years ago
@kaplun is the HEP keyword dataset open for public use?
@jstypka Just wanted to check the status of this. Any progress?
@kaplun 👆
Hi @jstypka Sure. It was generated all on public data and INSPIRE license on metadata is CC0.
@jstypka however I am not sure we have at hand the original dataset. You setup up the original instance of magpie back in the day...
Found the original repo, though it's 7.5GB. I could share it via a CERNBox shared space.
When I was working on it, there was a massive XML file that was accessible under a public URL
Ah sure. All the generated can be recreated from: http://inspirehep.net/dumps/inspire-dump.html .
@iamsiva11 under this URL ☝️ you can download the HEP file, which is a massive gzipped XML. Each entry in it corresponds to a publication and one of the fields in the XML (don't remember the enigmatic name unfortunately) lists the keywords.
The vocabulary of keywords is in the tens of thousands, while there are hundreds of thousands of samples (or perhaps even millions), so the dataset should be pretty good for your needs.
Cheers!
PS Let us know if you get some results on this datasets. Would be interested to compare to what we're getting.
Hi, first of all thank you for open sourcing this library.
Is there any way to get access to datasets other than the default one in the repo?. I'm looking at very large datasets with very large label sizes(More than 10K). Moreover, most of the public multi-label datasets are available in pre-processed arff format.
Thanks in advance