mattjj / pyhsmm

MIT License
547 stars 173 forks source link

better parallel interface #16

Closed mattjj closed 11 years ago

mattjj commented 11 years ago

One should only need to do this after starting engines:

import pyhsmm.parallel # creates client and stuff

# either this
for data in datas:
    model.add_data_parallel(data)

# or this, which could be greedy or smart
model.add_datas_parallel(datas)

for i in progprint_xrange(1000):
    model.resample_model_parallel()

Each sequence should be sent to only one engine (load balancing based on greedy sequence length balancing and assuming all engines are about the same speed), and each engine will only resample on its assigned data. If data added to the model with add_data already exists on some engine (checked via hash), don't broadcast it; if it doesn't exist on the engines, send it to one.

There should be another dynamic load balancing mode which does the current thing: load all data on each engine and dispatch resampling tasks dynamically. Maybe something like this:

model.broadcast_data_parallel(data)
for i in progprint_xrange(1000):
    model.resample_model_parallel_lbv()
mattjj commented 11 years ago

Redone in bd56899.