Questions about the datasets you used

hsqmlzno1 / HATN

Hierarchical Attention Transfer Network for Cross-domain Sentiment Classification (AAAI'18)

MIT License

83 stars 24 forks source link

Questions about the datasets you used #10

Open Flitternie opened 5 years ago

Flitternie commented 5 years ago

Great works! May I know the source of the dataset you used in your paper and in this repo.? I noticed that you cited Blitzer et al.'s work in ACL2007 as the source of the dataset in your paper but their original dataset has only 2000 labeled datapoints(reviews) in total for each domain, while in yours, the total no. is 6000. So may I know how did you augment the original dataset, or may I know if there are any other datasets you have used in your works? Looking forward to your reply. Thanks.

Flitter

hsqmlzno1 commented 5 years ago

The data is from Blitzer et al.'s work. I do some random sampling for the unprocessed version (unprocessed.tar.gz) of the Amazon reviews dataset (http://www.cs.jhu.edu/~mdredze/datasets/sentiment/). Please note that the unlabeled data still have labels.

You can also try the original small-scale data, it works well too. But please refer to the tips for the small setting in the Readme.

hsqmlzno1 commented 5 years ago

The reason for using a larger dataset is that: unsupervised domain adaptation (UDA) assumes that there exists a large amount of labeled data in a source domain. So I think the previous setting that only uses 2000 labeled data does not match the assumption.

Flitternie commented 5 years ago

Thanks for your reply. May I know if you uploaded the .py file that you used to preprocess the original unprocessed dataset? I am trying to reproduce your result recently and I want to use the original small dataset so I can compare your works with others' better. May I know how did you preprocess the data in ./raw_data into the format of ./data?

hsqmlzno1 commented 5 years ago

Done