salimk / Rcrawler

An R web crawler and scraper
http://www.sciencedirect.com/science/article/pii/S2352711017300110
Other
350 stars 92 forks source link

Scrape according to predict() result #11

Closed jacekkotowski closed 7 years ago

jacekkotowski commented 7 years ago

Dear Salim, Dear Mohamed,

Your tool is awesome.

I would like to propose a big feature.

Let's assume we have a corpus of files we already scraped and found more interesting than others. We built a classification model that we can use on new documents with predict command to get a category or yes, no.

Can the rcrawler functions for crawling or scraping be developed to accept a model as a parameter to scrape only content that fall in specific category?

Example: I have collected a couple of texts that are related to individual bad mortgage loans in Swiss franc. It is a controversy in my country. And I collected equal number of articles related to other issues in the same "economy" section.
It would be a marvelous tool if I could tell it to scrape(starting_point=http://mywebsite/economy, predict_filter=predict(model=my_classification_model)

Wishing you all the best,

Jacek

salimk commented 7 years ago

Hello Jacek, We can add this feature however, we don't know which type of classification model will be provided by user, as well as the data mining package deployed for this purpose and we don't want our package to be dependent on others for optional features, so to extend the usability of Rcrawler we have come up with this idea. we added a parameter called FUNPageFilter which takes a function as an argument. This function should test the web page if it is eligible or not, following your rules methods/prediction technics ; So first you create a function in your computer, it must take two arguments url, content then returns true or false.

Mytestfunction<-function (url, content){
#Here you should test content variable using your prediction model  
#This function must return a logical value TRUE/FALSE (  if you want to collect the page or not)
return result; 
}

Finally, you just call Rcrawler like , ( the crawler will evaluate every page using your function before collecting it.

Rcrawler(Website = "http://glofile.com", no_cores = 2, no_conn = 2, FUNPageFilter= Mytestfunction )

Waiting for your review, Salim