Open yrm14 opened 6 days ago
The updated proposal is too ambitiou and at the same time lacks some details. Comments / Questions below:
What seed URLs You are going to use ? Ans - We are Planning to take URLs from the Website https://www.topuniversities.com/world-university-rankings This contains top university names. We will apply webscrapping and try to find URLs, Or we will do this manually, this will not be very tough. We guaranty that It will be done.
How do you plan to "Crawl" data relevant to such a task? Taking URL from seed link and putting it in priority queue (In papers they call URL Frontier). Downloading that document and extracting the links from it and if it's already downloaded we will not download it.
One more thing is that searching entire web is really complex, to avoid this we are going with the simplest method which is prioritizing links with fewer slashes (URLs which are more root like ). And also one which is more similar with our query, for example, checking for keywords in the URL that contains terms like "syllabus" "programming" etc. Eventually identification of these documents will be performed by a Machine Learning algorithm, but to narrow down our search space we need to do this, also if it becomes too longer we will limit crawling to certain number of documents for example after k number of documents crawling will stop, which will not highly affect recall because priority wise potential URLs will be at starting after reordering.
Here is one more paper I found in which how to give "Importance" to certain URLs are given and also different algorithm for that, https://www.researchgate.net/publication/222489237_Efficient_Crawling_Through_URL_Ordering
Relevance Judgement For Crawled Data - Yes sir, you are right we don't have relevance judgement for test data which is documents coming from web but we were thinking to estimate manually for example if we have retrieved 10000 of documents we will take random sample of 100 from them and then check how many of them are actually syllabus and from that we were planning to measure Precision. Do this few times and get average. ( We tried to find but couldn't find how to automate this task )
Also one thing we needed to clarify is that we don't have dedicated one research paper for overall idea of the project, we are just finding nuts and bolts from different papers.
The crawling strategy makes no sense. Limiting to certain steps might not even fetch you relevant documents.
As mentioned multiple times annotating your own dataset with relevamce judgements is not allowed. Plus you have no idea how many documents are required for proper evaluation. 100 is not enough.
The task is poorly defined. Slabus webpages will have the keyword "syllabus" most of the time, making the task too trivial.
There is no use case of such a task. Why would anyone want to train a "classifier" for identifying syllabus?
This project is not worth pursuing. Please make an alternate proposal.
Title
Development of Web Crawler and Document Classification System using Information Retrieval and Machine Learning Models
Team Name
IRFighters
Email
202103045@daiict.ac.in
Team Member 1 Name
Priyesh Tandel
Team Member 1 Id
202101222
Team Member 2 Name
Keertivardhan Goyal
Team Member 2 Id
202103007
Team Member 3 Name
Yash Mashru
Team Member 3 Id
202103045
Team Member 4 Name
Sanchit Satija
Team Member 4 Id
202103054
Category
Reproducibility
Problem Statement
Suppose we are given the list of seed URLs, then the crawler systematically parses through all the seed links and following them, looks for the pages that are likely to be documents of a specific type (e.g. syllabus, patents, instruction manuals, etc.). In our case, we aim to find the programming course syllabus.
So basically we will do the following things:
Before crawling, we would have already trained the classifier with our training data which will help the crawler to further classify the testing data. We are planning to use 2 types of classifiers namely Random Forest and Support Vector Machines(SVM).
Evaluation Strategy
There are two main views of evaluation: 1) Evaluation of the classifier and it's training: --> We will perform cross-validation and calculate the average precision and recall across all folds of the classifier's performance on the training data using sklearn library.
2) Evaluation of the system as a whole: --> Evaluation of the system will be very complex as there might be many documents. So we are planning to take random sample of a small fraction (1 % approx) of these documents and manually checking which documents are relevant and estimating precision, recall, and F1 score from that.
[Note: We have relevance judgements for our training data]
Dataset
https://drive.google.com/drive/folders/1T-tt23twrIlGCYCX5uybncSwOBH9DDXU?usp=sharing
Resources
For understanding how Web Crawler works - Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 2:219{229, 1999.
For understanding how SVM can be used for classification of Syllabus - https://www.researchgate.net/publication/228798898_Automatic_Syllabus_Classification_using_Support_Vector_Machines