semio / ddf_utils

Utilities for working with DDF datasets
https://open-numbers.github.io/
MIT License
2 stars 1 forks source link

EntityDomain.add_entity() slow #119

Open miroli opened 4 years ago

miroli commented 4 years ago

We've run into some performance issues when running ddf_utils.package.create_datapackage(). We have some files with hundreds of thousands of entities and running this function takes a very long time in those cases.

After some profiling it turns out that the culprit is EntityDomain.add_entity() in ddf_utils.model.ddf which as I understand it loops through all rows in entity files and runs some identity checks. Would it be possible to vectorize that loop?

semio commented 4 years ago

Yes, you are right, calling add_entity for a lot of entities is expensive. I think it's possible to avoid calling add_entity one by one, I will improve the codes soon

semio commented 4 years ago

@miroli I updated the process for loading entity domains, and I tested the create_datapackage function against a dataset with 1,000,000 entities and it can create the datapackage in 12 minutes.

Could you test the master branch against your dataset? If it's not convenient for you to install from source I will make a release for you.

miroli commented 4 years ago

That's great news! If you could make a release, that would be even greater as installing from source is tricky with our current setup.

semio commented 4 years ago

ok, v1.0.6 is ready, please have a try

miroli commented 4 years ago

It's much better now, thanks!