Closed pdelteil closed 3 years ago
Thanks for flagging this. I will look into the inconsistency. For reference, can you let me know how many URLs are in your dataset so I can troubleshoot the slow response time?
Thanks for flagging this. I will look into the inconsistency. For reference, can you let me know how many URLs are in your dataset so I can troubleshoot the slow response time?
Hi there @honoki,
Around 350k urls.
I believe the slow response time is likely mostly due to the time it takes to download the 350k URLs from the BBRF server in plain text. CouchDB does not support compression by default, so downloading huge JSON documents (which is the case for 350k URLs) takes a while to download. I have some ideas to improve the BBRF server Docker image with an NGINX reverse proxy that enables compression and see how it fares, but nothing in the pipeline.
Can you verify the time it takes to download the view manually:
time curl $(jq -r .couchdb ~/.bbrf/config.json)'/_design/bbrf/_view/search_tags?key=\["source","httpx"\]' -i -u bbrf:password
If this is already really slow, at least I know there's not much use in looking at the client code. 😅
As for the described inconsistency: the fix appears to be a simple change to https://github.com/honoki/bbrf-client/blob/master/bbrf/bbrf.py#L835 which I've got lined up for 1.1.8
I believe the slow response time is likely mostly due to the time it takes to download the 350k URLs from the BBRF server in plain text. CouchDB does not support compression by default, so downloading huge JSON documents (which is the case for 350k URLs) takes a while to download. I have some ideas to improve the BBRF server Docker image with an NGINX reverse proxy that enables compression and see how it fares, but nothing in the pipeline.
Can you verify the time it takes to download the view manually:
time curl $(jq -r .couchdb ~/.bbrf/config.json)'/_design/bbrf/_view/search_tags?key=\["source","httpx"\]' -i -u bbrf:password
If this is already really slow, at least I know there's not much use in looking at the client code. sweat_smile
This is the output of the command above (~326K urls) :
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 22.7M 0 22.7M 0 0 3892k 0 --:--:-- 0:00:05 --:--:-- 3883k
real 0m6.034s
user 0m0.315s
sys 0m0.260s
Hi @pdelteil - OK, looks like it will need some work on the client side maybe. For the sake of clarity, can you create a new bug for the slow processing issue? I'm closing this as the inconsistency when using source should be resolved.
Hi @pdelteil - OK, looks like it will need some work on the client side maybe. For the sake of clarity, can you create a new bug for the slow processing issue? I'm closing this as the inconsistency when using source should be resolved.
Sure, I will.
Thanks a lot!
I was testing the use of this syntax to retrieve urls by source:
> bbrf urls -p PROGRAM where source is 'httpx'
It takes some time (more than 1 minute, but it works)
But, if I do the following:
It retrieves all urls (from all programs with the source 'httpx')
I'm using v1.1.7 and the latest server update.