scrapy / itemloaders

Library to populate items using XPath and CSS with a convenient API
BSD 3-Clause "New" or "Revised" License
45 stars 16 forks source link

Speed up arg_to_iter #51

Closed Gallaecio closed 10 months ago

Gallaecio commented 2 years ago

Fixes #50

If we restrict the subset of sequence and generator types that we allow arg_to_iter to handle, we can remove the use of is_item altogether to significantly improve its performance.

I tested performance with:

from timeit import timeit

from itemloaders.utils import arg_to_iter

def a():
    return arg_to_iter({})

print(timeit(a, number=1000000))

Results:

codecov[bot] commented 2 years ago

Codecov Report

Merging #51 (93d0dd7) into master (760e925) will decrease coverage by 0.01%. Report is 2 commits behind head on master. The diff coverage is 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #51 +/- ## ========================================== - Coverage 99.25% 99.24% -0.01% ========================================== Files 4 4 Lines 267 266 -1 ========================================== - Hits 265 264 -1 Misses 2 2 ``` | [Files](https://app.codecov.io/gh/scrapy/itemloaders/pull/51?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [itemloaders/utils.py](https://app.codecov.io/gh/scrapy/itemloaders/pull/51?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-aXRlbWxvYWRlcnMvdXRpbHMucHk=) | `93.54% <100.00%> (-0.21%)` | :arrow_down: |
Gallaecio commented 2 years ago

The documentation issue is caused by https://github.com/scrapy/scrapy/issues/5402

wRAR commented 2 years ago

So I guess this changes the result from true to false for some types not listed in isinstance?

Gallaecio commented 2 years ago

It does.

I think supporting only a limited number of iterables as input for values is a low price for this much of a performance improvement. To be honest, I would have even dropped generator support be it not for the existing tests for it.

I think it would be best to document the specific iterables that are supported as input for values, limit the implementation to that subset, and let users do the required conversion for scenarios where custom types are needed.

Not a strong opinion, though. But since is_item is more likely to support a wider range of values, and likely to become slower as well as a result, I think avoiding its usage here is key for performance.

ava7 commented 5 months ago

It does.

I think supporting only a limited number of iterables as input for values is a low price for this much of a performance improvement. To be honest, I would have even dropped generator support be it not for the existing tests for it.

I think it would be best to document the specific iterables that are supported as input for values, limit the implementation to that subset, and let users do the required conversion for scenarios where custom types are needed.

Not a strong opinion, though. But since is_item is more likely to support a wider range of values, and likely to become slower as well as a result, I think avoiding its usage here is key for performance.

@Gallaecio Any chance of keeping the set? :pleading_face:

Gallaecio commented 5 months ago

Seems reasonable to me, feel free to open a PR.