salimk / Rcrawler

An R web crawler and scraper
http://www.sciencedirect.com/science/article/pii/S2352711017300110
Other
350 stars 92 forks source link

Normalization of Relative Links #35

Closed michascholz closed 6 years ago

michascholz commented 6 years ago

Thanks a lot for all the work!

Several websites include links with relative references (e.g., "page-1.html" instead of "http://domain.com/page-1.html"). The LinkNormalization function works fine for absolute links but fails to correctly normalize relative links. Can you please extend that function so that it correctly recognizes relative links and, if necessary, not only adds the protocol to a link but also the base url.

Best wishes, Michael

salimk commented 6 years ago

If you can be more specific about the link structure you want to normalize. To process relative links you shoud also set "current" argument which represent the current web document URL. For now, these are supported link structures
thanks for your feedback will try to improve the function in the next release

links<-c("http://www.twitter.com/share?url=http://glofile.com/page.html", "/finance/banks/page-2017.html", "./section/subscription.php", "//section/", "www.glofile.com/home/", "glofile.com/sport/foot/page.html", "sub.glofile.com/index.php", "http://glofile.com/page.html#1", "?tags%5B%5D=votingrights&amp;sort=popular") > LinkNormalization(links,"http://glofile.com" ) [1] "http://glofile.com/finance/banks/page-2017.html" [2] "http://glofile.com/section/subscription.php" [3] "http://www.glofile.com/home/" [4] "http://glofile.com/sport/foot/page.html" [5] "http://sub.glofile.com/index.php" [6] "http://glofile.com/page.html" [7] "http://glofile.com?tags%5B%5D=votingrights&amp;sort=popular"

salimk commented 5 years ago

Rcrawler v0.1.9 is released with a lot of features, subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s