english.json - Githubissues

idiap / gile

A generalized input-label embedding for text classification

GNU General Public License v3.0

23 stars 6 forks source link

english.json #4

Open terry07 opened 4 years ago

terry07 commented 4 years ago

Initially, thanks for the total repo.

Trying to reproduce the ZSL task on BioASQ data, english.json file is requested which is not found anywhere here. Could you give me a hint?

nik0spapp commented 4 years ago

Thanks for your interest!

Are you using the run.py stored under hdf5\ for the BioASQ experiment? Normally, only if the --pretrained option is on this file is required. By default, the word embeddings for this experiment are learned end-to-end. If you are interested in using this option then you need to download the word embeddings and point to them in the--wordemb_path argument.

Let me know if that solves the problem.

terry07 commented 4 years ago

Thanks for the answer. I was using the run.py in the main folder. However, after converting the file for running on Python 3, the files english-xx.h5 that are placed in test folder were asked to be found into the train folder. Afterwards i get the next error:

'OSError: Unable to open file (truncated file: eof = 129956355, sblock->base_addr = 0, stored_eof = 258265696)'

I am actually interested on running just the Biomedical semantic indexing experiment, but i get errors trying to run the 'python run.py --languages english --data_path data/bioasq/ --path exp/gile-wan --train \ --wdim 100 --bs 64 --sampling 0.03 --la --ladim 500 --lpad 50 --maskedavg' command.

First, do you have any variant for Python3 or Windows? And could you please mention me the necessary steps for running just the asked experiment?

Thanks for your time.

nik0spapp commented 4 years ago

Good catch! It looks like that the train/ folder just hasn't been stored properly in the zip file (it has only 4 out of 65 files); fixing this might take some time. Note that both train/ and test/ folders are expected to contain english-xx.h5 files stored in them which (of course) have no content overlap.

In the meantime, you could parse the BioASQ dataset in h5 format from Nam et al. (2016). Their GitHub repository is https://github.com/JinseokNam/AiTextML. Note that we follow the same setup with them, and the provided dataset was meant to be for extra convenience.

Currently, the code supports only Python 2.7. The same holds for the CUDA/cuDNN version and other dependencies listed under the Installation section.

Hope it helps.