Closed pekorinko closed 3 years ago
駄目なURL
スクレイピングできるURL
URI::InvalidURIError in ResultsController#create
bad URI(is not URI?): "https://www.google.com/search?rlz=1C5CHFA_enJP939JP939&tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=ALeKk02BIgdj-xn18bahS-OBGfwlooPbig:1617887991230&q=%E4%B8%89%E9%B7%B9+%E3%82%AB%E3%83%95%E3%82%A7&rflfq=1&num=10&sa=X&ved=2ahUKEwijsIO43u7vAhUVVpQKHRAyCewQjGp6BAgGEFI&biw=1065&bih=614#lrd=0x6018efc9ce1628e5:0x8ff90a0be4382a21,1,,,&rlfi=hd:;si:10374334262814452257,l,ChDkuInpt7kg44Kr44OV44KnWioKEOS4iem3uSDjgqvjg5XjgqciEOS4iem3uSDjgqvjg5XjgqcqBAgDEAGSAQtjb2ZmZWVfc2hvcKoBHAoJL20vMDIwZmIyEAEqDSIJ44Kr44OV44KnKAg,y,iGQHtR5exv0;mv:[[35.706644499999996,139.5884908],[35.679823,139.54468549999999]]"
https://www.google.com/search?rlz=1C5CHFA_enJP939JP939&tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=ALeKk02BIgdj-xn18bahS-OBGfwlooPbig:1617887991230&q=%E4%B8%89%E9%B7%B9+%E3%82%AB%E3%83%95%E3%82%A7&rflfq=1&num=10&sa=X&ved=2ahUKEwijsIO43u7vAhUVVpQKHRAyCewQjGp6BAgGEFI&biw=1065&bih=614#lrd=0x6018efc9ce1628e5:0x8ff90a0be4382a21,1,,,&rlfi=hd:;si:10374334262814452257,l,ChDkuInpt7kg44Kr44OV44KnWioKEOS4iem3uSDjgqvjg5XjgqciEOS4iem3uSDjgqvjg5XjgqcqBAgDEAGSAQtjb2ZmZWVfc2hvcKoBHAoJL20vMDIwZmIyEAEqDSIJ44Kr44OV44KnKAg,y,iGQHtR5exv0
https://www.google.com/search?q=%E5%B9%B3%E5%A1%9A+%E3%82%AB%E3%83%95%E3%82%A7&rlz=1C5CHFA_enJP939JP939&tbm=lcl&sxsrf=ALeKk002n5E6WBTpY35NYkbwGaKYjs7qjg%3A1617890452060&ei=lAxvYKufA5HpmAXfv5zQBg&oq=%E5%B9%B3%E5%A1%9A+%E3%82%AB%E3%83%95%E3%82%A7&gs_l=psy-ab.3..0i4k1j0i7i30k1j0i4k1j0i7i30k1j0i4k1j0i7i30k1j0i4k1j0i7i30k1.50969.54092.0.55029.11.11.0.0.0.0.498.1712.1j4j2j0j1.8.0....0...1c.1j4.64.psy-ab..5.6.1327...38j35i39k1j0i67k1j0i7i4i30k1.0.ZaZT-BnYi_4#lrd=0x6019ad2441313afb:0xdd80ec42ac8b598c,1,,,&rlfi=hd:;si:15961016850507848076,l,ChDlubPloZog44Kr44OV44KnWiEKCeOCq-ODleOCpyIQ5bmz5aGaIOOCq-ODleOCpyoCCAOSARtqYXBhbmVzZV9pemFrYXlhX3Jlc3RhdXJhbnSqARwKCS9tLzAyMGZiMhABKg0iCeOCq-ODleOCpygI,y,ENeXqOozp2s;mv:[[35.3831256,139.3652272],[35.3149229,139.3256056]]
mv:[[35.3831256,139.3652272],[35.3149229,139.3256056]]
は「平塚 カフェ」で調べたときに地図がカフェを表示する範囲の緯度経度を示しているみたい
URLから正規表現で
mv:[[35.706644499999996,139.5884908],[35.679823,139.54468549999999]]
を削除して残りのURLを貰う
class ResultsController < ApplicationController
def new; end
def create
url = params[:url]
#ここでURLを正規表現できれいにする
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
end
end
↑URLを正規表現できれいにする場所
results_controller.rb
にてスクレイピングできないURLに関しては正規表現で緯度経度を削除したことで解決
class ResultsController < ApplicationController
def new; end
def create
url = params[:url]
url = url.split('mv')[0] if url.include?('mv')
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
end
end
mv
以降を正規表現で削除しても駄目だったURLhttps://www.google.com/search?q=%E3%82%AB%E3%83%95%E3%82%A7%20%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF&rlz=1C5CHFA_enJP939JP939&sxsrf=ALeKk00zpw5cUObozsVIUmv0GYPLFySEVA:1617938869297&ei=qslvYKSCKsKzmAW07rq4Ag&oq=%E3%82%AB%E3%83%95%E3%82%A7%E3%80%80%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF&gs_lcp=Cgdnd3Mtd2l6EAMYADIECCMQJzIECAAQQzIECAAQBDIECAAQQzIECAAQBDIECAAQBDIECAAQBDIECAAQBDoHCCMQsAMQJzoHCAAQRxCwAzoHCCMQ6gIQJzoHCAAQsQMQBDoKCAAQsQMQgwEQBDoJCCMQJxBGEP8BOgoIABCxAxCxAxAEOhAIABCxAxCDARCxAxCDARAEOgUIABCxAzoGCAAQBBAKUJIWWLQ4YMlKaAJwAngAgAHJAYgB-RaSAQYxLjE3LjGYAQCgAQGqAQdnd3Mtd2l6sAEKyAEJwAEB&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=10977287492993434335&lqi=Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ&ved=2ahUKEwiUicr8m_DvAhWDJaYKHafdB6MQvS4wAnoECAMQKQ&rlst=f#lrd=0x6018f51fe09589bd:0x985728d0913642df,1,,,&rlfi=hd:;si:10977287492993434335,l,Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ;mv:[[35.6132762,139.6720453],[35.6100405,139.62310829999998]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4
https://www.google.com/search?hl=ja&tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=ALeKk017QoV-6BDtnVYqWM0i9BPBu9nqFQ:1617939413997&q=%E5%B9%B3%E5%A1%9A%E3%80%80%E3%82%AB%E3%83%95%E3%82%A7&rflfq=1&num=10&ved=2ahUKEwiM3KeAnvDvAhUUxYsBHaOSAEsQtgN6BAgDEAc#lrd=0x6019acfedaf8fc61:0x9bfb9b70ec7ae435,1,,,&rlfi=hd:;si:11239748204339323957,l,ChLlubPloZrjgIDjgqvjg5XjgqdIl76S4JurgIAIWicKCeOCq-ODleOCpxABGAAYASIQ5bmz5aGaIOOCq-ODleOCpyoCCAOSAQRjYWZlqgEcCgkvbS8wMjBmYjIQASoNIgnjgqvjg5XjgqcoCA,y,jQoKoFqS9NY;mv:[[35.448799974143306,139.62492307407229],[35.19164237952109,139.18203672153322],null,[35.32032340479415,139.40347989780275],11]
https://www.google.com/search?q=%E3%82%AB%E3%83%95%E3%82%A7%20%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF&rlz=1C5CHFA_enJP939JP939&sxsrf=ALeKk00zpw5cUObozsVIUmv0GYPLFySEVA:1617938869297&ei=qslvYKSCKsKzmAW07rq4Ag&oq=%E3%82%AB%E3%83%95%E3%82%A7%E3%80%80%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF&gs_lcp=Cgdnd3Mtd2l6EAMYADIECCMQJzIECAAQQzIECAAQBDIECAAQQzIECAAQBDIECAAQBDIECAAQBDIECAAQBDoHCCMQsAMQJzoHCAAQRxCwAzoHCCMQ6gIQJzoHCAAQsQMQBDoKCAAQsQMQgwEQBDoJCCMQJxBGEP8BOgoIABCxAxCxAxAEOhAIABCxAxCDARCxAxCDARAEOgUIABCxAzoGCAAQBBAKUJIWWLQ4YMlKaAJwAngAgAHJAYgB-RaSAQYxLjE3LjGYAQCgAQGqAQdnd3Mtd2l6sAEKyAEJwAEB&sclient=gws-wiz&tbs=lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=10977287492993434335&lqi=Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ&ved=2ahUKEwiUicr8m_DvAhWDJaYKHafdB6MQvS4wAnoECAMQKQ&rlst=f#lrd=0x6018f51fe09589bd:0x985728d0913642df,1,,,&rlfi=hd:;si:10977287492993434335,l,Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ;mv:[[35.6132762,139.6720453],[35.6100405,139.62310829999998]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4
分かったこと
;tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4
の部分があると口コミに遷移してくれない
↓①口コミ画面のURLなのに
↓②別タブで開くとここから口コミ画面に画面遷移しない
上記の②の画面の「Google口コミ(64)」をクリックして取得したURLだと別タブで開いてもスクレイピングできる
https://www.google.com/search?q=%E3%82%AB%E3%83%95%E3%82%A7%20%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF&hl=ja&sxsrf=ALeKk02jUNVuIJaqNnWbwzghmu4OE_uBlQ:1617940528548&source=hp&ei=LNBvYOHDFcX_-Qbs5pawCA&iflsig=AINFCbYAAAAAYG_ePI36dB_uhJLq53Vca-U6--tQkKbi&oq=%E3%82%AB%E3%83%95%E3%82%A7&gs_lcp=Cgdnd3Mtd2l6EAMYADIECCMQJzIECAAQBDIECAAQQzIQCAAQsQMQgwEQsQMQgwEQBDIECAAQBDIKCAAQsQMQsQMQBDIECAAQBDIECAAQBDoJCCMQJxBGEP8BOgcIABCxAxAEOgoIABCxAxCDARAEUI0OWJgUYLkdaABwAHgBgAGnAogBsQiSAQUwLjUuMZgBAKABAaoBB2d3cy13aXo&sclient=gws-wiz&tbs=lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4&tbm=lcl&rflfq=1&num=10&rldimm=10977287492993434335&lqi=Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ&ved=2ahUKEwi63eKTovDvAhVXed4KHaSKD_oQvS4wAnoECAMQKQ&rlst=f#lrd=0x6018f51fe09589bd:0x985728d0913642df,1,,,&rlfi=hd:;si:10977287492993434335,l,Chbjgqvjg5Xjgqcg44Oq44K844OD44K_IgOIAQFaOAoW44Kr44OV44KnIOODquOCvOODg-OCvyIW44Kr44OV44KnIOODquOCvOODg-OCvyoGCAIQABABkgEEY2FmZQ;mv:[[35.6132762,139.6720453],[35.6100405,139.62310829999998]];tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4
【スクレイピングできるURL、出来ないURL共通点】
URLの各要素(アンカー、パラメータの構造)は同じ
恐らくq=~
の部分がスクレイピング出来ないURLの場合
以下のパラメータ部分が検索画面からGoogleMap一覧で表示させる挙動を担っている
hl=ja&tbs=lf:1,lf_ui:9&tbm=lcl&sxsrf=ALeKk017QoV-6BDtnVYqWM0i9BPBu9nqFQ:1617939413997
↑上記パラメータをスクレイピング出来ないURL(検索画面しか表示されない)の
https://www.google.com/search?q=%E3%82%AB%E3%83%95%E3%82%A7%20%E3%83%AA%E3%82%BC%E3%83%83%E3%82%BF
に&
でつなげたところ、添付画像のところまで動いた
分かったこと
;tbs:lrf:!1m4!1u3!2m2!3m1!1e1!2m1!1e3!3sIAE,lf:1,lf_ui:4
の部分があると口コミに遷移してくれない
results_controller.rb
にて以下正規表現を行うことで解消
url = url.split(';tbs:lrf:')[0] if url.include?(';tbs:lrf:')
url = url.split('mv:[[')[0] if url.include?('mv:[[')
新たに発見した駄目なURL①
URI::InvalidURIError (bad URI(is not URI?):
"https://www.google.com/search?rlz=1C5CHFA_enJP939JP939&tbs=lf:1,lf_ui:2&tbm=lcl&sxsrf=ALeKk00kOSDHMU6K4_hh7ufjqfda1V7xbw:1618129730154&q=%E5%90%89%E7%A5%A5%E5%AF%BA+%E7%BE%8E%E5%AE%B9%E9%99%A2&rflfq=1&num=10&ved=2ahUKEwjSnpD-4vXvAhXVBIgKHQOjD3EQtgN6BAggEAc#lrd=0x6018efed64a5bd7f:0x4117c2d34ded2ccc,1,,,&rlfi=hd:;si:4690431749730938060,l,ChPlkInnpaXlr7og576O5a656ZmiWiYKCue-juWuuSDpmaIiFOWQieelpeWvuiDnvo7lrrkg6ZmiKgIIA5IBCmhhaXJfc2Fsb26qARIQASoOIgrnvo7lrrkg6ZmiKAg;mv:[[35.708251499999996,139.5831203],[35.7006455,139.5739491]]"):
新たに発見した駄目なURL②
https://www.google.com/search?rlz=1C5CHFA_enJP939JP939&tbs=lf:1,lf_ui:2&tbm=lcl&sxsrf=ALeKk00kOSDHMU6K4_hh7ufjqfda1V7xbw:1618129730154&q=%E5%90%89%E7%A5%A5%E5%AF%BA+%E7%BE%8E%E5%AE%B9%E9%99%A2&rflfq=1&num=10&ved=2ahUKEwjSnpD-4vXvAhXVBIgKHQOjD3EQtgN6BAggEAc#lrd=0x6018ee47f9692c31:0xa91d9f56ebe51e40,1,,,&rlfi=hd:;si:12186071362408095296,l,ChPlkInnpaXlr7og576O5a656ZmiWiYKCue-juWuuSDpmaIiFOWQieelpeWvuiDnvo7lrrkg6ZmiKgIIA5IBCmhhaXJfc2Fsb26qARIQASoOIgrnvo7lrrkg6ZmiKAg;mv:[[35.708251499999996,139.5831203],[35.7006455,139.5739491]]
lib/my_tools/url_validator.rb:8 で怒られている
url_validator
をurl_filter
より先に呼び出していたのが問題だった変更前
def create
url = params[:url]
url_validator = MyTools::UrlValidator.new(url)
if url_validator.validate
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
else
flash.now[:alert] = 'URLが不正です'
render :new
end
url_filter = MyTools::UrlFilter.new(url)
url = url_filter.filter
end
変更後
def create
url = params[:url]
url_filter = MyTools::UrlFilter.new(url)
url = url_filter.filter
url_validator = MyTools::UrlValidator.new(url)
if url_validator.validate
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
else
flash.now[:alert] = 'URLが不正です'
render :new
end
end
url_validator
とurl_filter
をどっち先に書くか問題url_filter
を先に呼ぶとうまくいく
url_validator
を先に読んでしまうとurl_validator
でURI.parse()
したときに不要な要素を含んでいるためエラーが起きるurl_validator
を先に書くとうまくいく
url_validator.rb
にてURLに';tbs:lrf:'
もしくは'mv:[['
を含んでいないときだけURI.parse
するというコードを書いたdef validate
unless @url.include?(';tbs:lrf:') || @url.include?('mv:[[')
uri = URI.parse(@url)
uri.host == 'www.google.com' ? true : false
end
end
if url.include?('www.google.com')
url = url_filter.filter
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
end
if url.exclude?('www.google.com') && url_validator.validate
# GoogleのURLで変な要素付きのURLでもurl_validatorでURI.parseしちゃうからお腹壊す
# もしhostがgoogleじゃなかった時だけ「url_validator.validate」を起動させたい
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
elsif url.exclude?('www.google.com') && !url_validator.validate
flash.now[:alert] = 'URLが不正です'
render :new
end
createメソッド全体像(参考、後で戻したくなったときのために)
def create
url = params[:url]
url_filter = MyTools::UrlFilter.new(url)
url_validator = MyTools::UrlValidator.new(url)
if url.include?('www.google.com')
url = url_filter.filter
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
end
if url.exclude?('www.google.com') && url_validator.validate
# GoogleのURLで変な要素付きのURLでもurl_validatorでURI.parseしちゃうからお腹壊す
# もしhostがgoogleじゃなかった時だけ「url_validator.validate」を起動させたい
place_data_scraper = MyTools::PlaceDataScraper.new(url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
elsif url.exclude?('www.google.com') && !url_validator.validate
flash.now[:alert] = 'URLが不正です'
render :new
end
end
【results_controller.rbのcreateメソッド】
def create
@url = params[:url]
if @url.include?('www.google.com')
url_filter = MyTools::UrlFilter.new(@url)
@url = url_filter.filter
end
url_validator = MyTools::UrlValidator.new(@url)
if url_validator.validate
@url = url_validator.validate
place_data_scraper = MyTools::PlaceDataScraper.new(@url)
place = place_data_scraper.save_place
place_data_scraper.save_review(place.id)
check_credibility = MyTools::CheckCredibility.new(place.id)
@result = check_credibility.credibility
elsif url.exclude?('www.google.com') && !url_validator.validate
flash.now[:alert] = 'URLが不正です'
render :new
end
end
【url_validator.rbのvalidateメソッド】
<変更前>
def validate
uri = URI.parse(@url)
uri.host == 'www.google.com' ? true : false
end
<変更後>
def validate
if @url.include?('www.google.com')
return @url
else
false
end
end
今のコントローラーのコードだと不要物が入っていないURLが入力された場合がうまく行かない
https://www.google.com/search?q=%E5%90%89%E7%A5%A5%E5%AF%BA+%E3%83%81%E3%83%A3%E3%82%A4%E3%83%86%E3%82%A3%E3%83%BC&rlz=1C5CHFA_enJP939JP939&oq=%E5%90%89%E7%A5%A5%E5%AF%BA%E3%80%80%E3%83%81%E3%83%A3%E3%82%A4%E3%83%86%E3%82%A3%E3%83%BC&aqs=chrome.0.69i59j0.5734j0j9&sourceid=chrome&ie=UTF-8#lrd=0x6018ee482cb92a9b:0x4d6c0ab242091eef,1,,,
def filter
if @url.include?(';tbs:lrf:') && @url.include?('mv:[[')
@url.split('mv:[[')[0]
elsif @url.include?(';tbs:lrf:')
@url.split(';tbs:lrf:')[0]
elsif @url.include?('mv:[[')
@url.split('mv:[[')[0]
elsif @url.exclude?(';tbs:lrf:') || @url.exclude?('mv:[[')
return @url
end
tbs = ';tbs:lrf:'
mv = 'mv:[['
とすることでコードを分かりやすくした
変更前
def filter
if @url.include?(';tbs:lrf:') && @url.include?('mv:[[')
@url.split('mv:[[')[0]
elsif @url.include?(';tbs:lrf:')
@url.split(';tbs:lrf:')[0]
elsif @url.include?('mv:[[')
@url.split('mv:[[')[0]
elsif @url.exclude?(';tbs:lrf:') || @url.exclude?('mv:[[')
return @url
end
変更後
def filter
tbs = ';tbs:lrf:'
mv = 'mv:[['
if @url.include?(tbs) && @url.include?(mv)
@url.split(mv)[0]
elsif @url.include?(tbs)
@url.split(tbs)[0]
elsif @url.include?(mv)
@url.split(mv)[0]
elsif @url.exclude?(tbs) || @url.exclude?(mv)
return @url
end
end