stanikol / olx

Web crawler / web scraper micro-service for downloading advertisements from http://olx.ua
15 stars 1 forks source link

code implementation for the Polish olx website #1

Closed karamucho closed 6 years ago

karamucho commented 6 years ago

Hi! In the beginning I wanted to congratulate you on a very smart code and at the same time ask for help in implementing it for the Polish site - olx.pl. The olx.ua and olx.pl code looks almost identically and the scraper handles all the data except the most important one - the telephone number. Obtains only zero in place of the phone. I'm stuck on this problem and I m asking for help. Below I paste a slightly modified GrabOlx.scala code from lines 62 to 95

        siteid = Try("""(?:ID ogłoszenia\:\s*)(\d+)""".r.findFirstMatchIn(soup.select("span:matches((Dodane|Dodane z telefonu))").head.text()).get.group(1)).toOption.getOrElse("")
        //
        phoneToken = getPhoneToken(responseBody)
        cookies = getCookies(response).map {
          case HttpCookie(name, value, _, _, _, _, _, _, _) if name.equals("pt") =>
            HttpCookiePair(name, phoneToken)
          case x@HttpCookie(name, value, _, _, _, _, _, _, _) =>
            HttpCookiePair(name, value)
        }
        //
        data = Map(
          "seq" -> counter.get().toString,
          "siteid" -> siteid,
          "brief" -> Try(soup.select("table:contains(Oferta od) > tbody > tr > td").map(_.text).mkString("\n")).toOption.getOrElse(""),
          "head" -> soup.select(".offer-titlebox h1").headOption.map(_.text().trim).getOrElse(""),
          "text" -> soup.select("#textContent").headOption.map(_.text()).getOrElse(""),
          "pubdate" -> soup.select("span:matches((Dodane|Dodane z telefonu))").headOption.map(_.text()).getOrElse(""),
          "url" -> uri,
          "usrid" -> Try(usrid).toOption.getOrElse(""),
          "user" -> Try(soup.select("a:matches((?i)(Ogłoszenia użytkownika|Ogłoszenia użytkownika))").head.attr("href")).getOrElse(""),
          "price" -> soup.select("div.pricelabel strong").headOption.map(_.text()).getOrElse(""),
          "location" -> soup.select(".show-map-link").headOption.map(_.text()).getOrElse(""),
          "username" -> soup.select(".userdetails span").headOption.map(_.text()).getOrElse(""),
          "section" -> Try(soup.select("#breadcrumbTop ul span").map(_.text()).filter(_.nonEmpty).mkString("/")).toOption.getOrElse(""),
          "viewed" -> Try(soup.select("div.pdingtop10:contains(Wyświetleń:) > strong").head.text()).toOption.getOrElse(""),
          "downdate" -> DateTime.now().toString("yyyy-MM-dd HHmmss")
        )
      } yield (data, Cookie(cookies), phoneToken, usrid)
    }
  }

  def downloadPhones(implicit as: ActorSystem, mat: ActorMaterializer, ec: ExecutionContext) =
    Flow[(Map[String, String], Cookie, String, String)].mapAsync(10){ case (data, cookie, phoneToken, usrid) =>
      val phonesUri = s"https://www.olx.pl/ajax/misc/contact/phone/$usrid/?pt=${phoneToken}"

I will be very grateful for any help Regards, Karamucho

stanikol commented 6 years ago

Hi, It looks like anti-bot protection system on olx.pl uses google recaptcha api. You can try to hack it or, probably, more easier solution would be to use selenium instead of direct Ajax calls. Hope that helps. Sincerely, Stan