inspirehep / magpie

Deep neural network framework for multi-label text classification
MIT License
684 stars 192 forks source link

Datasets other than default #111

Open iamsiva11 opened 7 years ago

iamsiva11 commented 7 years ago

Hi, first of all thank you for open sourcing this library.

Is there any way to get access to datasets other than the default one in the repo?. I'm looking at very large datasets with very large label sizes(More than 10K). Moreover, most of the public multi-label datasets are available in pre-processed arff format.

Thanks in advance

jstypka commented 7 years ago

@kaplun is the HEP keyword dataset open for public use?

iamsiva11 commented 6 years ago

@jstypka Just wanted to check the status of this. Any progress?

jstypka commented 6 years ago

@kaplun 👆

kaplun commented 6 years ago

Hi @jstypka Sure. It was generated all on public data and INSPIRE license on metadata is CC0.

kaplun commented 6 years ago

@jstypka however I am not sure we have at hand the original dataset. You setup up the original instance of magpie back in the day...

kaplun commented 6 years ago

Found the original repo, though it's 7.5GB. I could share it via a CERNBox shared space.

jstypka commented 6 years ago

When I was working on it, there was a massive XML file that was accessible under a public URL

kaplun commented 6 years ago

Ah sure. All the generated can be recreated from: http://inspirehep.net/dumps/inspire-dump.html .

jstypka commented 6 years ago

@iamsiva11 under this URL ☝️ you can download the HEP file, which is a massive gzipped XML. Each entry in it corresponds to a publication and one of the fields in the XML (don't remember the enigmatic name unfortunately) lists the keywords.

The vocabulary of keywords is in the tens of thousands, while there are hundreds of thousands of samples (or perhaps even millions), so the dataset should be pretty good for your needs.

Cheers!

PS Let us know if you get some results on this datasets. Would be interested to compare to what we're getting.