Can Recrawler be used to scrape/crawl bilingual sites based on CSS selectors or Xpath?

salimk / Rcrawler

An R web crawler and scraper

http://www.sciencedirect.com/science/article/pii/S2352711017300110

Other

350 stars 92 forks source link

Can Recrawler be used to scrape/crawl bilingual sites based on CSS selectors or Xpath? #21

Closed mzeidhassan closed 5 years ago

mzeidhassan commented 6 years ago

Hi Recrawler team,

I am new to R and Recrawler. I would like to know if Recrawler can be used to scrape/crawl bilingual sites, let's say I have this English site: https://government.ae/en and this is the corresponding Arabic one: https://government.ae/ar-ae

How can I use Recrawler to get the bitext from them and save the output in tab-delimited file? Can you crawl only texts based on div tag, CSS selectors or maybe xpath?

Thanks

salimk commented 5 years ago

Yes you just have to set appropriate Sys.locale

For Arabic : Sys.setlocale("LC_ALL","Arabic") In this example we fetch extract titles Rcrawler(Website = "https://government.ae/",no_cores = 4,no_conn = 4, ExtractXpathPat = "//*/h2")

rcrawlerea

salimk commented 5 years ago

Rcrawler v0.1.9 is just released with a lot of features , Subscribe to our mailing list to receive last updates http://eepurl.com/dMv_7s