scrapinghub / arche

Analyze scraped data
https://arche.readthedocs.io/
MIT License
47 stars 19 forks source link

value_counts is slow for nested columns #115

Closed manycoding closed 5 years ago

manycoding commented 5 years ago

More data to follow Because value_counts is slow, any big df makes report_all awfully slow.

  1. See if it can be improved
  2. If not, exlude get_categories from report_all or make a parameter like (fast=True)
manycoding commented 5 years ago

Introduced in #100

manycoding commented 5 years ago

The slowest are object columns, specifically nested data. The difference in 10-1000x. @ejulio @Alexandr1988, I am thinking about alternatives:

  1. Ignore object columns
  2. fast argument, so the category method won't use object columns, and fast schema #5 Arche(..fast=True). Default False.
  3. Leave it as it is (users are expected to use specific rules, but the rule has a progress bar)
  4. ~Use flat_df - very fast, but the categories pictures in this case is not the same since data is flattened.~ - doesn't make any sense since flat data is sparse.
victor-torres commented 5 years ago

If there's not a cache, seems like you're calculating value_counts twice. See: https://github.com/scrapinghub/arche/pull/100/files#r295002793

victor-torres commented 5 years ago

You could use a generator to reduce the number of calls to the function. See: https://github.com/scrapinghub/arche/pull/100/files#r295004052

manycoding commented 5 years ago

@victor-torres It doesn't make much difference since object columns are that slow anyway. Pandas has its own caching. But I compared performance Screenshot 2019-06-18 at 16 35 41

ejulio commented 5 years ago

No major comments here. Maybe I can search a bit on pandas internals. I'd only mention that we would need to be careful about fast argument because we already use something like that for fastjsonschema and we need to avoid misinterpretations :)

manycoding commented 5 years ago

I'd only mention that we would need to be careful about fast argument

I thought we can put fast and if it's True use the fastest validations

  1. excluding nested data from categories (exlusion is done by using common with flat_df columns, I think that's the fastest way. But we need to drop nan columns from flat_df first, for this dataset it's 1 minute)
  2. fastjsonschema instead of jsonschema

But we need to find a solution since right now value_counts degrades performance in 100-1000x times if nested columns are there. Which makes this particular dataset untestable.

Without nested data it takes around 30 seconds. Full dataset - I stopped after 30 minutes. One nested column (photos) takes 13 minutes.