ufal / perin

PERIN is Permutation-Invariant Semantic Parser developed for MRP 2020
45 stars 4 forks source link

Solving SAT takes forever #20

Open hankcs opened 2 years ago

hankcs commented 2 years ago

Dear authors,

Thank you for releasing your wonderful code, it really helped my understanding of your paper. If you don't mind, I have a question regarding data preprocessing. It just takes forever to solve SAT using the base_amr.yaml config.

Console logs:

Loading the cached dataset
Max number of permutations to resolve assignment ambiguity: 165198623617843200000
... reduced to 2048 permutations with max of 24 greedily resolved assignments
0 erroneously matched sentences with companion

57274 sentences in the train split
3460 sentences in the validation split
2457 sentences in the test split
789678 nodes in the train split
properties:  ['transformed']
Edge frequency: 5.17 %
4319 words in the relative label vocabulary
114 words in the edge label vocabulary
242 characters in the vocabulary
Caching the dataset...

0 erroneously matched sentences with companion
Generating possible rules using 4 CPUs...
Solving SAT...

It has been hanging on this line for days. I'm using a server with power CPUs (Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz) and hundreds of GBs of memories.

davda54 commented 2 years ago

Hi, thanks for using PERIN! Generally, solving the SAT problem can take a couple of hours and it’s quite possible that the algorithm will not be able to find any solution in a reasonable time for a custom dataset. The SAT heuristics can be very unpredictable. Are you using the official AMR dataset from MRP2020 or some other one?

You can also use a greedy search for a suboptimal solution, for this exact reason. Just call the function get_smallest_rule_set with approximate=True and it will run a faster algorithm. The solution will not be as good, but it shouldn't lead to any significant performance drop.

hankcs commented 2 years ago

Thanks for your prompt rely. I'm using MRP2020_Train_Dev-2020CoNLL_CFMRP_LDC2020E05.tgz from LDC, which might not be exactly the same with the one used in MRP2020 competion. Maybe split_dataset.sh creates random split too?

I'll try approximate=True and other solvers.