probcomp / hierarchical-irm

Probabilistic structure discovery for rich relational systems
Apache License 2.0
2 stars 1 forks source link

pclean integration tests are way too slow #146

Closed ThomasColthurst closed 1 month ago

ThomasColthurst commented 1 month ago

Even when running on just 100 line csv files.

Preliminary debugging suggests that the slow down happens when incorporating data from string fields. (But not string_cat fields).

ThomasColthurst commented 1 month ago

The slowness happens in incorporate_observations_relation for string attributes when h_irm->incorporate is called on the BigramStringEmission.

It doesn't happen for string_cat attributes, which is interesting because they also use the BigramStringEmission. So for example for the flights data, incorporating observations for "src" (which is string_cat) takes a fraction of a second, but incorporating a single value for the flight relation (which is string) takes 1-2 seconds.

ThomasColthurst commented 1 month ago

Further debugging reveals that all the time is being spent in computing topk_alignments between long-but-nonsense latent string values like U@ksQ^5y|BTBpe)e#zz_84]kWu>qkTKpMs)5~:hlrz"H"+Fe'3Zm6P6q;(8+Hx!QjwOS/n.es/eyD3KpNIQFT:90JlEOI@X|K?}bCQ<UxtDf]O,RX>r^*#kQ1XI>TrL4AD>]mG!SyZ^ qwjkB/='-(1kY$8>C3Nm\s;^iiAfPO?E[iR4'/hmKyHzg and the "dirty" observed value like "AA-1640-MIA-MCO".

ThomasColthurst commented 1 month ago

So we have a bunch of options here:

1) Reduce the number of alignments considered when incorporating a BigramStringEmission down from 10 to 2. This gives an immediate 5x speed-up at probably very very little loss of quality.

2) Do a better job of initializing string latent values.

3) Implement https://github.com/probcomp/hierarchical-irm/issues/118, which would provide training data for the BigramString distribution and make the sampled latent value shorter.

4) Add a default maximum string length (which can be overridden via a distribution option). The existing pclean program has something like this.

I think I'm going to go with option 1 for now, but it is worth keeping the others in mind.

ThomasColthurst commented 1 month ago

Update: the pclean integrations tests are still too slow, even after changing the number of alignments considered to 2.

The new culprit is inference; specifically the transitioning latent values step. When run on a 10 line version of the flights data, transition_latent_values on the relation "Time:time" takes over 30 seconds.

ThomasColthurst commented 1 month ago

We probably will want to speed up inference more at some point, but for now compiling pclean with -c opt and running it on 10 line CSV files makes it fast enough for the integration tests.