scrapy-plugins / scrapy-splash

Scrapy+Splash for JavaScript integration
BSD 3-Clause "New" or "Revised" License
3.15k stars 450 forks source link

Required argument is missing: url #125

Closed sarbazx closed 7 years ago

sarbazx commented 7 years ago

{"info": {"argument": "url", "type": "argument_required", "description": "Required argument is missing: url"}, "type": "BadOption", "error": 400, "description": "Incorrect HTTP API arguments"}

getting this error with this splash request like this:

start_urls = ["http://www.gittigidiyor.com/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 1.5, 'http_method': 'POST'}, endpoint='render.json')
kmike commented 7 years ago

@sarbazx have you enabled all options and middlewares, as explained in README?

sarbazx commented 7 years ago

@kmike yes

SPLASH_URL = 'http://172.17.0.1:8050/'
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
FEED_EXPORT_ENCODING = 'utf-8'
DOWNLOAD_DELAY = 1.25
kmike commented 7 years ago

I think there could be a problem with the way start_requests is handled in Splash. I'll look at it in more detail tomorrow.

In the meantime, could you try working around it by sending requests in a parse callback instead of start_urls? You may use a local file for a "fake" request, or send a request to http://example.com - does the following work?

    start_urls = ["http://www.gittigidiyor.com/"]

    def start_requests(self):
        yield scrapy.Request('http://example.com', self.fake_start_requests)

    def fake_start_requests(self, response):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, 
                args={'wait': 1.5, 'http_method': 'POST'}, 
                endpoint='render.json'
          )
sarbazx commented 7 years ago

@kmike tried, same error.

kmike commented 7 years ago

Hm, can you try the exmple from the repo? (https://github.com/scrapy-plugins/scrapy-splash/tree/master/example). It is not that different from what you're doing, and it works for me. Maybe you can modify this example to do the same as your project - it can be easier to spot the issue this way.

lovesuper commented 7 years ago

and so?

toddpod commented 6 years ago

@sarbazx How did you solve this problem, I got the same error and can't find the answer .

sarbazx commented 6 years ago

@toddpod I couldn't solve I just switched to ruby 😂😂

haoshuncheng commented 6 years ago

same problem,help! @toddpod

lopuhin commented 6 years ago

FWIW, I tried to reproduce the error using code from the first comment and didn't get this error, everything is working as expected: 504 error for trying to make a POST request and 200 response if making a GET request.

If someone could provide a more complete example that shows how the error happens, this would help.

eneyi commented 5 years ago

This was resolved after specifying USER_AGENT ='my_user_agent' in settings.py

azmirfakkri commented 5 years ago

Hi I'm getting this error:

{"error": 400, 
  "info": {"argument": "lua_source", 
           "type": "argument_required", 
           "description": "Required argument is missing: lua_source"}, 
  "type": "BadOption", 
  "description": "Incorrect HTTP API arguments"
}

Despite the correct path specified and the correct SplashRequest. I've run this code with and without Docker and seems to give the error above. Anyone have any idea where I could be wrong? Cheers.

lua_source = ''.join(open('/app/path/to/my_lua.lua').readlines())

def parse(self, response):
        url = 'https://www.someurl.com/'
        email = str(os.environ.get('EMAIL'))
        password = str(os.environ.get('PASSWORD'))
        user_agent = str(os.environ.get('BROWSER_USER_AGENT'))

        yield SplashRequest(
            url=url,
            callback=self.parse_result,
            endpoint='execute',
            args={'lua_source': lua_source,
                  'timeout': 300,
                  'email': email,
                  'pass': password,
                  'search_text': self.search_text,
                  'user_agent': user_agent,
                  'latitude': str(self.latitude),
                  'longitude': str(self.longitude)
                  },
            splash_url=os.environ.get('SPLASH_URL')
        )
0x81ec commented 5 years ago

same problem

codewithpatch commented 4 years ago

I'm having the same problem. I tried a lot of work around, But I still can't get the generated html of the javascript. The render.html is returning me and error 400: { "error": 400, "type": "BadOption", "description": "Incorrect HTTP API arguments", "info": { "type": "argument_required", "argument": "url", "description": "Required argument is missing: url" } }

This is my spider.py 'start_urls = ['https://www.empresia.es/empresa/repsol/']

def start_requests(self):
    yield scrapy.Request( 'http://example.com', self.fake_start_requests )

def fake_start_requests(self, response):
    for url in self.start_urls:
        yield SplashRequest( url, self.parse,
                             args={'wait': 1.5, 'http_method': 'POST'},
                             endpoint='render.html'
                             )

def parse(self, response):
    open_in_browser(response)
    title = response.css("title").extract()
    # har = response.data["har"]["log"]["pages"]
    headers = response.headers.get('Content-Type')
    names = response.css('.fa-user-circle-o+ a::text').extract()
    yield {
        'title': title,
        #'har': har,
        'headers': headers,
        'names': names,
        'length': len(names)
    }'

and here is my settings.py

`# settings.py USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Splash Settings

DOWNLOADER_MIDDLEWARES = {

Engine side

'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# Downloader side

}

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } SPLASH_URL = 'http://127.0.0.1:8050/' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'`

JavierRuano commented 4 years ago

Do you have tried with the example, perhaps the problem is another ... https://splash.readthedocs.io/en/stable/api.html

Curl example:

curl 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

El lun., 30 dic. 2019 2:36, patchikoooo notifications@github.com escribió:

I'm having the same problem. I tried a lot of work around, But I still can't get the generated html of the javascript. The render.html is returning me and error 400: { "error": 400, "type": "BadOption", "description": "Incorrect HTTP API arguments", "info": { "type": "argument_required", "argument": "url", "description": "Required argument is missing: url" } }

This is my spider.py ` start_urls = ['https://www.empresia.es/empresa/repsol/']

def start_requests(self): yield scrapy.Request( 'http://example.com', self.fake_start_requests )

def fake_start_requests(self, response): for url in self.start_urls: yield SplashRequest( url, self.parse, args={'wait': 1.5, 'http_method': 'POST'}, endpoint='render.html' )

def parse(self, response): open_in_browser(response) title = response.css("title").extract()

har = response.data["har"]["log"]["pages"]

headers = response.headers.get('Content-Type')
names = response.css('.fa-user-circle-o+ a::text').extract()
yield {
    'title': title,
    #'har': har,
    'headers': headers,
    'names': names,
    'length': len(names)
}`

and here is my settings.py

`# settings.py USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36' Obey robots.txt rules

ROBOTSTXT_OBEY = True Splash Settings

DOWNLOADER_MIDDLEWARES = {

Engine side

'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

Downloader side

}

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } SPLASH_URL = 'http://127.0.0.1:8050/' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'`

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/125?email_source=notifications&email_token=AIGDFOYYED3IIMO2AVE3GJDQ3FGABA5CNFSM4DQ2YBZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHZM63A#issuecomment-569560940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGDFO4JVUA6R3J35BGYTHLQ3FGABANCNFSM4DQ2YBZQ .

codewithpatch commented 4 years ago

I tried curl -o 'log.html' 'http://localhost:8050/render.html?url=https://www.empresia.es/empresa/repsol/&timeout=10&wait=0.5'

And I still don't get the html that I need. It was just returning me the html before javascript.

eneyi commented 4 years ago

I'm having the same problem. I tried a lot of work around, But I still can't get the generated html of the javascript. The render.html is returning me and error 400: { "error": 400, "type": "BadOption", "description": "Incorrect HTTP API arguments", "info": { "type": "argument_required", "argument": "url", "description": "Required argument is missing: url" } }

This is my spider.py 'start_urls = ['https://www.empresia.es/empresa/repsol/']

def start_requests(self):
    yield scrapy.Request( 'http://example.com', self.fake_start_requests )

def fake_start_requests(self, response):
    for url in self.start_urls:
        yield SplashRequest( url, self.parse,
                             args={'wait': 1.5, 'http_method': 'POST'},
                             endpoint='render.html'
                             )

def parse(self, response):
    open_in_browser(response)
    title = response.css("title").extract()
    # har = response.data["har"]["log"]["pages"]
    headers = response.headers.get('Content-Type')
    names = response.css('.fa-user-circle-o+ a::text').extract()
    yield {
        'title': title,
        #'har': har,
        'headers': headers,
        'names': names,
        'length': len(names)
    }'

and here is my settings.py

`# settings.py USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Splash Settings

DOWNLOADER_MIDDLEWARES = {

Engine side

'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

Downloader side

}

SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } SPLASH_URL = 'http://127.0.0.1:8050/' DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'`

Not sure this is relevant, but it worked when i added the 'www' to the url: so 'http://www.example.com' in start_requests. Hope it helps

adam-poole-kr commented 2 years ago

Looks like this error happens if you are missing the liquibase.command.url in your liquibase.properties file or didn't pass it in as an argument when attempting to run.