n-waves / multifit

The code to reproduce results from paper "MultiFiT: Efficient Multi-lingual Language Model Fine-tuning" https://arxiv.org/abs/1909.04761
MIT License
282 stars 56 forks source link

how to setup the format of data as input #60

Closed KelvinBull closed 4 years ago

KelvinBull commented 4 years ago

Hi , I am reproducing your nice script but I don't know how to setup the format of data as input, namely the final . To clearly to get it, Could you give me an example to show? For example, how DATASET #cls-acl10-unprocessed# is actually .xml file , So it will be processed to be .csv file? what .csv file will be like finally? give me a snip simply.Thank you in advance.

blazejdolicki commented 4 years ago

I'm also wondering about it. Probably they just used tag names in .xml ('category','rating','realname'... etc.) as column names in .csv, but it would be nice to get a confirmation.

blazejdolicki commented 4 years ago

@KelvinBull I know it's an old issue, but here's the solution: to get the data in .csv format just run python prepare_cls.py https://storage.googleapis.com/ulmfit/cls1 as suggested in this issue: https://github.com/n-waves/multifit/issues/32#issuecomment-464773677

eisenjulian commented 4 years ago

Hey @KelvinBull @blazejdolicki sorry for the delay, the code to parse the original dataset into the csv files was not merged into master. It lives at https://github.com/n-waves/multifit/blob/datasets/prepare_cls.sh Let me know if you have any questions