Inconsistencies for what is allowed in a dataset - Githubissues

srlearn / datasets

srlearn-compatible relational datasets

https://srlearn.github.io/relational-datasets/

MIT License

2 stars 0 forks source link

Inconsistencies for what is allowed in a dataset #5

Closed hayesall closed 3 years ago

hayesall commented 3 years ago

I usually assume examples and facts should look like this:

example(one).
example(one,two).

Multiple places in the data violate this:

Blank lines:

https://github.com/srlearn/datasets/blob/084197b2d50f2d8f5674d29867a634ff9fccbe71/srlearn/uwcse/uwcse/fold2/train/train_facts.txt#L1-L4

https://github.com/srlearn/datasets/blob/084197b2d50f2d8f5674d29867a634ff9fccbe71/srlearn/uwcse/uwcse/fold3/train/train_facts.txt#L731-L734

https://github.com/srlearn/datasets/blob/084197b2d50f2d8f5674d29867a634ff9fccbe71/srlearn/uwcse/uwcse/fold3/train/train_facts.txt#L1181-L1183

Furthermore, these should probably be normalized to eliminate spaces between commas and other inconsistencies.

SRLBoost and BoostSRL derivatives allow quite a few additional symbols in the grammar (including % comments and //- comments)

uwcse:

[x] uwcse - fold 2: 1 error
[x] uwcse - fold 3: 4 errors
[x] uwcse - fold 4: 4 errors
[x] uwcse - fold 5: 4 errors

citeseer:

[x] citeseer: fold 1: 36,803 errors (including expression '//-3405')
[x] citeseer: fold 2: 34,855 errors ('//-0001', ''//-0002'', ... '//-3406')
[x] citeseer: fold 3: 30,531 errors
[x] citeseer: fold 4: 33,272 errors

cora:

[x] cora: fold 1: 40 errors
[x] cora: fold 2: 40 errors
[x] cora: fold 3: 40 errors
[x] cora: fold 4: 40 errors
[x] cora: fold 5: 40 errors

hayesall commented 3 years ago

Fixed in #10