keavil / AAAI18-code

The code of AAAI18 paper "Learning Structured Representation for Text Classification via Reinforcement Learning".
215 stars 81 forks source link

能否开源一下训练和测试数据? #1

Closed iamsile closed 6 years ago

iamsile commented 6 years ago

您好,能否开源一下训练和测试数据?

谢谢!

keavil commented 6 years ago

数据文件太大了,没办法传到github上。 请给我您的邮箱。

XSilverBullet commented 6 years ago

能否开源下train test数据,邮箱是weisun_@outlook.com

iamsile commented 6 years ago

这是我的邮箱:416566462@qq.com 麻烦您发我一下,先谢谢您了

iamsile commented 6 years ago

@keavil 您好,请问您最近能否有时间发一下数据给我哈,先谢谢您了哈,我的邮箱是416566462@qq.com

diaodiao1987 commented 6 years ago

您好,请有时间能否把数据也发我一份,万分感谢哈。我的邮箱是542275403@qq.com

keavil commented 6 years ago

一个一个发数据实在太慢了,这里我上传了一份使用的AGnews数据集,其他数据集格式类似。 https://drive.google.com/open?id=1becf7pzfuLL7qgWqv4q-TyDYjSzodWfR

iamsile commented 6 years ago

@keavil 多谢多谢,已下载 你好,@keavil,看了数据集有些困惑,想请教你一下,

在您的数据集,数据都是如下的格式: {"rating": 3, "depth": 1, "children": [{"depth": 3, "children": [{"depth": 4, "word": "Dozens"}, {"depth": 4, "children": [{"depth": 5, "word": "of"}, {"depth": 5, "children": [{"depth": 6, "word": "government"}, {"depth": 6, "word": "proposals"}]}]}]}, {"depth": 3, "children": [{"depth": 4, "children": [{"depth": 4, "word": "arise"}, {"depth": 4, "children": [{"depth": 4, "children": [{"depth": 5, "word": ","}, {"depth": 5, "word": "but"}]}, {"depth": 4, "children": [{"depth": 4, "word": "never"}, {"depth": 4, "children": [{"depth": 5, "word": "get"}, {"depth": 5, "children": [{"depth": 5, "word": "off"}, {"depth": 5, "children": [{"depth": 6, "word": "the"}, {"depth": 6, "word": "ground"}]}]}]}]}]}]}, {"depth": 4, "word": "."}]}]}

只抽取"word"字段后,就是"Dozens of government proposals arise , but never get off the ground ."

而在最原始的AGnews数据集中,发现数据都是这种格式: "4","Study shows governments' open-source embrace","Dozens of government proposals arise, but never get off the ground."

我想请教一下,您是如何从最原始的数据集转成现在这种格式的,实在不理解其中的"depth"字段是如何处理出来的。数据集里的这种结构是通过句法分析处理后得到的吗?

希望您有时间了帮我解答一下,十分感谢!

keavil commented 6 years ago

@iamsile 这个depth字段应该是曾经尝试过的遗留部分,代码中应该没用到。 结构就是普通的句法树结构

iamsile commented 6 years ago

多谢多谢

diaodiao1987 commented 6 years ago

您好,根据您的代码和提供的AG数据,我进行了实验复现。AG这个数据是4分类任务(有4个主题),参数与您的代码设置一致(lr:0.0005,mini-batchsize:5, dropout:0.5, 优化方法:Adam,,tau:0.1, r:0.1*k, k:4),词向量也是glove300维,但我的最佳acc只有80%,而论文中能达到92.5%。请问一下,是什么原因呢?哪里设置的不对?十分感谢您~~ image

keavil commented 6 years ago

@diaodiao1987 您好,是这样的,在LSTM预训练的过程中需要和之后的部分使用不同的训练速度,预训练的lr大概要定在0.01左右。你可以看到你的训练过程中预训练部分基本没怎么提高正确率,而实际上LSTM预训练就应该能达到一个比较高的结果。

XSilverBullet commented 6 years ago

我在GPU 1080Ti上跑了三天还没有跑完,是什么原因呢?

keavil commented 6 years ago

@XSilverBullet 这个数据集我用Titan大概要训练2天左右,你可以参考一下。这个代码是用tensorflow写的,实现的不太好,对GPU利用不够。

skythebug commented 4 years ago

@keavil 多谢多谢,已下载 你好,@keavil,看了数据集有些困惑,想请教你一下,

在您的数据集,数据都是如下的格式: {"rating": 3, "depth": 1, "children": [{"depth": 3, "children": [{"depth": 4, "word": "Dozens"}, {"depth": 4, "children": [{"depth": 5, "word": "of"}, {"depth": 5, "children": [{"depth": 6, "word": "government"}, {"depth": 6, "word": "proposals"}]}]}]}, {"depth": 3, "children": [{"depth": 4, "children": [{"depth": 4, "word": "arise"}, {"depth": 4, "children": [{"depth": 4, "children": [{"depth": 5, "word": ","}, {"depth": 5, "word": "but"}]}, {"depth": 4, "children": [{"depth": 4, "word": "never"}, {"depth": 4, "children": [{"depth": 5, "word": "get"}, {"depth": 5, "children": [{"depth": 5, "word": "off"}, {"depth": 5, "children": [{"depth": 6, "word": "the"}, {"depth": 6, "word": "ground"}]}]}]}]}]}]}, {"depth": 4, "word": "."}]}]}

只抽取"word"字段后,就是"Dozens of government proposals arise , but never get off the ground ."

而在最原始的AGnews数据集中,发现数据都是这种格式: "4","Study shows governments' open-source embrace","Dozens of government proposals arise, but never get off the ground."

我想请教一下,您是如何从最原始的数据集转成现在这种格式的,实在不理解其中的"depth"字段是如何处理出来的。数据集里的这种结构是通过句法分析处理后得到的吗?

希望您有时间了帮我解答一下,十分感谢!

请问如何处理数据生成句法树呢?方便分享一下代码吗?万分感谢

ludi1027 commented 2 years ago

@diaodiao1987 您好,请问您是已经复现成功了吗?我有些问题想请教您一下,可否留一个联系方式?