Closed ThomasColthurst closed 1 month ago
The slowness happens in incorporate_observations_relation for string attributes when h_irm->incorporate is called on the BigramStringEmission.
It doesn't happen for string_cat attributes, which is interesting because they also use the BigramStringEmission. So for example for the flights data, incorporating observations for "src" (which is string_cat) takes a fraction of a second, but incorporating a single value for the flight relation (which is string) takes 1-2 seconds.
Further debugging reveals that all the time is being spent in computing topk_alignments between long-but-nonsense latent string values like U@ksQ^5y|BTBpe)e#zz_84]kWu>qkTKpMs)5~:hlrz"H"+Fe'3Zm6P6q;(8+Hx!QjwOS/n.es/eyD3KpNIQFT:90JlEOI@X|K?}bCQ<UxtDf]O,RX>r^*#kQ1XI>TrL4AD>]mG!SyZ^ qwjkB/='-(1kY$8>C3Nm\s;^iiAfPO?E[iR4'/hmKyHzg and the "dirty" observed value like "AA-1640-MIA-MCO".
So we have a bunch of options here:
1) Reduce the number of alignments considered when incorporating a BigramStringEmission down from 10 to 2. This gives an immediate 5x speed-up at probably very very little loss of quality.
2) Do a better job of initializing string latent values.
3) Implement https://github.com/probcomp/hierarchical-irm/issues/118, which would provide training data for the BigramString distribution and make the sampled latent value shorter.
4) Add a default maximum string length (which can be overridden via a distribution option). The existing pclean program has something like this.
I think I'm going to go with option 1 for now, but it is worth keeping the others in mind.
Update: the pclean integrations tests are still too slow, even after changing the number of alignments considered to 2.
The new culprit is inference; specifically the transitioning latent values step. When run on a 10 line version of the flights data, transition_latent_values on the relation "Time:time" takes over 30 seconds.
We probably will want to speed up inference more at some point, but for now compiling pclean with -c opt and running it on 10 line CSV files makes it fast enough for the integration tests.
Even when running on just 100 line csv files.
Preliminary debugging suggests that the slow down happens when incorporating data from string fields. (But not string_cat fields).