data set missing for TF-IDF?

vimalk78 commented 7 years ago

when i run the TF-IDF script, python tfidf.py i get the error

IOError: [Errno 2] No such file or directory: '../data/valset.csv'

WanzhengZhu commented 7 years ago

Hi! I've got the same problem. Can anyone help answer the question?

npow commented 7 years ago

The updated code to generate the dataset can be found from this repository: https://github.com/rkadlec/ubuntu-ranking-dataset-creator

You can access the original dataset from http://cs.mcgill.ca/~npow1/data/

ubuntu_dataset.tgz contains the train/val/test csv files plus the filelist. This tarball is split into 5 parts. The files can be joined by using the command cat ubuntu_dataset.tgz.a* | tar xz

md5sums:

3503cb3531052e2796fb08359b78cc5e  ubuntu_dataset.tgz.ae
994e02f9af3f3c77728bea5852e9188f  ubuntu_dataset.tgz.ad
0905d2f4ba6b7985c542e875621d7cf5  ubuntu_dataset.tgz.aa
2319469adadb9a3584156937219039b8  ubuntu_dataset.tgz.ac
e753f6d47cc13e9bb322359f9ad92e08  ubuntu_dataset.tgz.ab

WanzhengZhu commented 7 years ago

@npow Thank you so much! Do you have the updated code to run RNN, LSTM and tf-idf for the updated dataset? It appears to me that the previous dataset has a different format of the updated dataset, and therefore the code can not be used directly.

npow commented 7 years ago

I thought the format is the same? Can you post some samples of it?

WanzhengZhu commented 7 years ago

@npow For the original dataset downloaded from http://cs.mcgill.ca/~npow1/data/, each line has three columns (context, response, flag). An example is as below: Column1: hello , what is the command to install a . deb file ? .. EOS sudo dpkg -i < file > Column2: pfifo thanks ;) Column3: 1 Column1 dialogue is separated by EOS

For the updated dataset generated from https://github.com/rkadlec/ubuntu-ranking-dataset-creator: train.csv also has three components. However, the dialogue seems to be unprocessed. An example is shown below: Column1: i think we could import the old comments via rsync, but from there we need to go via email. I think it is easier than caching the status on each bug and than import bits here and there eou eot it would be very easy to keep a hash db of message-ids eou sounds good eou eot ok eou perhaps we can ship an ad-hoc apt_prefereces eou eot version? eou eot thanks eou eot not yet eou it is covered by your insurance? eou eot yes eou but it's really not the right time :/ eou with a changing house upcoming in 3 weeks eou eot you will be moving into your house soon? eou posted a message recently which explains what to do if the autoconfiguration does not do what you expect eou eot how urgent is #896? eou eot not particularly urgent, but a policy violation eou eot i agree that we should kill the -novtswitch eou eot ok eou eot would you consider a package split a feature? eou eot context? eou eot splitting xfonts out of xfree86. one upload for the rest of the life and that's it eou eot splitting the source package you mean? eou eot yes. same binary packages. eou eot I would prefer to avoid it at this stage. this is something that has gone into XSF svn, I assume? eou eot Column2: basically each xfree86 upload will NOT force users to upgrade 100Mb of fonts for nothing eou no something i did in my spare time. eou Column3: 1

test.csv and valid.csv have totally different formats. They have context on the first Column, ground-truth response on the second column and 9 wrong responses on Column 3-11. Column1: im trying to use ubuntu on my macbook pro retina eou i read in the forums that ubuntu has a apple version now? eou eot not that ive ever heard of.. normal ubutnu should work on an intel based mac. there is the PPC version also. eou you want total control? or what are you wanting exactly? eou eot Column2: just wondering how it runs eou Column3: yes, that's what I did, exported it to a "id_dsa" file, then back to Ubuntu copied it into ~/.ssh/ eou Column4 .....and so on... until Column 11 (Column 3-11 are distractors).

By the way, I am able to run the code using original dataset.

npow commented 7 years ago

Ah ok, it's not a big difference. You need to either update the code to support the new format, or transform the new format to work with this repo.

Yeah, the new format doesn't perform NER on the tokens.

npow / ubottu

data set missing for TF-IDF? #8