scrapy / scurl

Performance-focused replacement for Python urllib
Apache License 2.0
21 stars 6 forks source link

Performance improvement #42

Closed malloxpb closed 6 years ago

malloxpb commented 6 years ago

This PR aims to improve the performance of SCURL

codecov[bot] commented 6 years ago

Codecov Report

Merging #42 into master will decrease coverage by 0.21%. The diff coverage is 61.86%.

@@            Coverage Diff             @@
##           master      #42      +/-   ##
==========================================
- Coverage   61.58%   61.37%   -0.22%     
==========================================
  Files           2        2              
  Lines         315      321       +6     
==========================================
+ Hits          194      197       +3     
- Misses        121      124       +3
Impacted Files Coverage Δ
scurl/canonicalize.pyx 19.46% <12.5%> (-1.96%) :arrow_down:
scurl/cgurl.pyx 84.13% <74.46%> (+4.41%) :arrow_up:
malloxpb commented 6 years ago

Hey @lopuhin , in this PR, I have successfully increased the performance of urljoin and canonicalize_url by doing the folowing:

These have brought significant increase to the performance of canonicalize_url as well as urljoin. In particular, the rate of link extraction from canonicalize_url increased from 31k links/sec to 44k links/sec. But let me know if there's anything that concerns you :)

malloxpb commented 6 years ago

For some reason the cgurl.cpp is conflicted with the one from master branch...

malloxpb commented 6 years ago

Hey @lopuhin , I have cleaned up the code even further by moving all the canonicalize code to canonocalize.pyx :) I think this PR is ready!

malloxpb commented 6 years ago

Hey @lopuhin , yeah sorry about that... I did not notice yesterday 😄 I just fixed it and the build is green now