Closed falcononrails closed 5 years ago
Leaving this here for anyone who runs into the same problem in the future.
I managed to solve my problem. It was with the way the POST request was being sent through cURL, not with Scrapyd.
After inspection of this request:
curl -v http://example.herokuapp.com/schedule.json -d project=default -d spider=my_spider -d target_url="https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377" --trace-ascii /dev/stdout
I got:
Warning: --trace-ascii overrides an earlier trace/verbose option
== Info: Trying 52.45.74.184...
== Info: TCP_NODELAY set
== Info: Connected to example.herokuapp.com (52.45.74.184) port 80 (#0)
=> Send header, 177 bytes (0xb1)
0000: POST /schedule.json HTTP/1.1
001e: Host: example.herokuapp.com
0043: User-Agent: curl/7.54.0
005c: Accept: */*
0069: Content-Length: 164
007e: Content-Type: application/x-www-form-urlencoded
00af:
=> Send data, 164 bytes (0xa4)
0000: project=default&spider=example&target_url=https://www.example.co
0040: m/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000
0080: &idtt=2,5&naturebien=1,2,4&ci=910377
== Info: upload completely sent off: 164 out of 164 bytes
Apparently, since the POST request is sent like this:
http://example.herokuapp.com/schedule.json?project=default&spider=example&target_url=https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
Whenever there is a &, it is considered as a new argument. So the URL part that gets taken into the target_url argument is only https://www.example.com/list.htm?tri=initial
and the rest is considered another argument of the POST request.
After using Postman and trying the following POST request:
POST /schedule.json HTTP/1.1
Host: example.herokuapp.com
Content-Type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW
cache-control: no-cache
Postman-Token: 004990ad-8f83-4208-8d36-529376b79643
Content-Disposition: form-data; name="project"
default
Content-Disposition: form-data; name="spider"
my_spider
Content-Disposition: form-data; name="target_url"
https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
------WebKitFormBoundary7MA4YWxkTrZu0gW--
It worked, and the job started successfully on Scrapyd!
The url to pass in:
https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
To use curl:
curl http://127.0.0.1:6800/schedule.json \
-d project=demo \
-d spider=test \
-d jobid=2019-08-09T21_34_17 \
--data-urlencode "arg1=https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377"
Result:
2019-08-09 21:47:29 [test] DEBUG: self.arg1: https://www.example.com/list.htm?tri=initial&enterprise=0&idtypebien=2,1&pxMax=1000000&idtt=2,5&naturebien=1,2,4&ci=910377
Besides, you can simply use ScrapydWeb:
I added the following code in my Spider class to be able to pass the URL as an argument:
(The replace function is to remove the backslashes introduced by terminal escaping).
The spider recognizes the url, starts parsing and closes perfectly locally when I run :
However, when I do the same thing through scrapyd, and I run:
I get an error because the url isn't parsed the same way as when using
scrapy crawl
.LOG:
After some experimentation, I discovered that for some reason, when passing the URL as a spider argument through scrapyd, it stops parsing whenever it reaches a & character.
I thought scrapyd automatically urlencodes the passed url as pointed out in this issue, but decoding it didn't solve the issue. I also tried adding the url without the terminal escaping, no luck.