Closed manycoding closed 5 years ago
Introduced in #100
The slowest are object
columns, specifically nested data. The difference in 10-1000x.
@ejulio @Alexandr1988, I am thinking about alternatives:
object
columnsfast
argument, so the category method won't use object
columns, and fast schema #5 Arche(..fast=True)
. Default False
.flat_df
- very fast, but the categories pictures in this case is not the same since data is flattened.~ - doesn't make any sense since flat data is sparse.If there's not a cache, seems like you're calculating value_counts
twice.
See: https://github.com/scrapinghub/arche/pull/100/files#r295002793
You could use a generator to reduce the number of calls to the function. See: https://github.com/scrapinghub/arche/pull/100/files#r295004052
@victor-torres It doesn't make much difference since object
columns are that slow anyway. Pandas has its own caching.
But I compared performance
No major comments here.
Maybe I can search a bit on pandas internals.
I'd only mention that we would need to be careful about fast
argument because we already use something like that for fastjsonschema
and we need to avoid misinterpretations :)
I'd only mention that we would need to be careful about fast argument
I thought we can put fast
and if it's True use the fastest validations
flat_df
columns, I think that's the fastest way. But we need to drop nan
columns from flat_df
first, for this dataset it's 1 minute)But we need to find a solution since right now value_counts
degrades performance in 100-1000x times if nested columns are there. Which makes this particular dataset untestable.
Without nested data it takes around 30 seconds. Full dataset - I stopped after 30 minutes. One nested column (photos) takes 13 minutes.
More data to follow Because
value_counts
is slow, any big df makes report_all awfully slow.