能否开源一下训练和测试数据？

iamsile commented 6 years ago

您好，能否开源一下训练和测试数据？

谢谢！

keavil commented 6 years ago

数据文件太大了，没办法传到github上。请给我您的邮箱。

XSilverBullet commented 6 years ago

能否开源下train test数据，邮箱是weisun_@outlook.com

iamsile commented 6 years ago

这是我的邮箱：416566462@qq.com 麻烦您发我一下，先谢谢您了

iamsile commented 6 years ago

@keavil 您好，请问您最近能否有时间发一下数据给我哈，先谢谢您了哈，我的邮箱是416566462@qq.com

diaodiao1987 commented 6 years ago

您好，请有时间能否把数据也发我一份，万分感谢哈。我的邮箱是542275403@qq.com

keavil commented 6 years ago

一个一个发数据实在太慢了，这里我上传了一份使用的AGnews数据集，其他数据集格式类似。 https://drive.google.com/open?id=1becf7pzfuLL7qgWqv4q-TyDYjSzodWfR

iamsile commented 6 years ago

@keavil 多谢多谢，已下载你好，@keavil，看了数据集有些困惑，想请教你一下，

在您的数据集，数据都是如下的格式： {"rating": 3, "depth": 1, "children": [{"depth": 3, "children": [{"depth": 4, "word": "Dozens"}, {"depth": 4, "children": [{"depth": 5, "word": "of"}, {"depth": 5, "children": [{"depth": 6, "word": "government"}, {"depth": 6, "word": "proposals"}]}]}]}, {"depth": 3, "children": [{"depth": 4, "children": [{"depth": 4, "word": "arise"}, {"depth": 4, "children": [{"depth": 4, "children": [{"depth": 5, "word": ","}, {"depth": 5, "word": "but"}]}, {"depth": 4, "children": [{"depth": 4, "word": "never"}, {"depth": 4, "children": [{"depth": 5, "word": "get"}, {"depth": 5, "children": [{"depth": 5, "word": "off"}, {"depth": 5, "children": [{"depth": 6, "word": "the"}, {"depth": 6, "word": "ground"}]}]}]}]}]}]}, {"depth": 4, "word": "."}]}]}

只抽取"word"字段后，就是"Dozens of government proposals arise , but never get off the ground ."

而在最原始的AGnews数据集中，发现数据都是这种格式： "4","Study shows governments' open-source embrace","Dozens of government proposals arise, but never get off the ground."

我想请教一下，您是如何从最原始的数据集转成现在这种格式的，实在不理解其中的"depth"字段是如何处理出来的。数据集里的这种结构是通过句法分析处理后得到的吗？

希望您有时间了帮我解答一下，十分感谢!

keavil commented 6 years ago

@iamsile 这个depth字段应该是曾经尝试过的遗留部分，代码中应该没用到。结构就是普通的句法树结构

iamsile commented 6 years ago

多谢多谢

diaodiao1987 commented 6 years ago

您好，根据您的代码和提供的AG数据，我进行了实验复现。AG这个数据是4分类任务（有4个主题），参数与您的代码设置一致（lr:0.0005，mini-batchsize:5， dropout:0.5，优化方法:Adam,，tau:0.1， r:0.1*k， k:4），词向量也是glove300维，但我的最佳acc只有80%，而论文中能达到92.5%。请问一下，是什么原因呢？哪里设置的不对？十分感谢您~~

keavil commented 6 years ago

@diaodiao1987 您好，是这样的，在LSTM预训练的过程中需要和之后的部分使用不同的训练速度，预训练的lr大概要定在0.01左右。你可以看到你的训练过程中预训练部分基本没怎么提高正确率，而实际上LSTM预训练就应该能达到一个比较高的结果。

XSilverBullet commented 6 years ago

我在GPU 1080Ti上跑了三天还没有跑完，是什么原因呢？

keavil commented 6 years ago

@XSilverBullet 这个数据集我用Titan大概要训练2天左右，你可以参考一下。这个代码是用tensorflow写的，实现的不太好，对GPU利用不够。

skythebug commented 4 years ago

@keavil 多谢多谢，已下载你好，@keavil，看了数据集有些困惑，想请教你一下，

在您的数据集，数据都是如下的格式： {"rating": 3, "depth": 1, "children": [{"depth": 3, "children": [{"depth": 4, "word": "Dozens"}, {"depth": 4, "children": [{"depth": 5, "word": "of"}, {"depth": 5, "children": [{"depth": 6, "word": "government"}, {"depth": 6, "word": "proposals"}]}]}]}, {"depth": 3, "children": [{"depth": 4, "children": [{"depth": 4, "word": "arise"}, {"depth": 4, "children": [{"depth": 4, "children": [{"depth": 5, "word": ","}, {"depth": 5, "word": "but"}]}, {"depth": 4, "children": [{"depth": 4, "word": "never"}, {"depth": 4, "children": [{"depth": 5, "word": "get"}, {"depth": 5, "children": [{"depth": 5, "word": "off"}, {"depth": 5, "children": [{"depth": 6, "word": "the"}, {"depth": 6, "word": "ground"}]}]}]}]}]}]}, {"depth": 4, "word": "."}]}]}

只抽取"word"字段后，就是"Dozens of government proposals arise , but never get off the ground ."

而在最原始的AGnews数据集中，发现数据都是这种格式： "4","Study shows governments' open-source embrace","Dozens of government proposals arise, but never get off the ground."

我想请教一下，您是如何从最原始的数据集转成现在这种格式的，实在不理解其中的"depth"字段是如何处理出来的。数据集里的这种结构是通过句法分析处理后得到的吗？

希望您有时间了帮我解答一下，十分感谢!

请问如何处理数据生成句法树呢？方便分享一下代码吗？万分感谢

ludi1027 commented 2 years ago

@diaodiao1987 您好，请问您是已经复现成功了吗？我有些问题想请教您一下，可否留一个联系方式？

keavil / AAAI18-code

能否开源一下训练和测试数据？ #1