parth126 / IT550

Project Proposals for the IT-550 Course (Autumn 2024)
0 stars 0 forks source link

Development of Web Crawler and Document Classification System using Information Retrieval and Machine Learning Models #27

Open yrm14 opened 6 days ago

yrm14 commented 6 days ago

Title

Development of Web Crawler and Document Classification System using Information Retrieval and Machine Learning Models

Team Name

IRFighters

Email

202103045@daiict.ac.in

Team Member 1 Name

Priyesh Tandel

Team Member 1 Id

202101222

Team Member 2 Name

Keertivardhan Goyal

Team Member 2 Id

202103007

Team Member 3 Name

Yash Mashru

Team Member 3 Id

202103045

Team Member 4 Name

Sanchit Satija

Team Member 4 Id

202103054

Category

Reproducibility

Problem Statement

Suppose we are given the list of seed URLs, then the crawler systematically parses through all the seed links and following them, looks for the pages that are likely to be documents of a specific type (e.g. syllabus, patents, instruction manuals, etc.). In our case, we aim to find the programming course syllabus.

So basically we will do the following things:

Before crawling, we would have already trained the classifier with our training data which will help the crawler to further classify the testing data. We are planning to use 2 types of classifiers namely Random Forest and Support Vector Machines(SVM).

Evaluation Strategy

There are two main views of evaluation: 1) Evaluation of the classifier and it's training: --> We will perform cross-validation and calculate the average precision and recall across all folds of the classi fier's performance on the training data using sklearn library.

2) Evaluation of the system as a whole: --> Evaluation of the system will be very complex as there might be many documents. So we are planning to take random sample of a small fraction (1 % approx) of these documents and manually checking which documents are relevant and estimating precision, recall, and F1 score from that.

[Note: We have relevance judgements for our training data]

Dataset

https://drive.google.com/drive/folders/1T-tt23twrIlGCYCX5uybncSwOBH9DDXU?usp=sharing

Resources

For understanding how Web Crawler works - Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 2:219{229, 1999.

For understanding how SVM can be used for classification of Syllabus - https://www.researchgate.net/publication/228798898_Automatic_Syllabus_Classification_using_Support_Vector_Machines

parth126 commented 4 days ago

The updated proposal is too ambitiou and at the same time lacks some details. Comments / Questions below:

  1. How do you plan to "Crawl" data relevant to such a task? The proposed training data is binary labels about whether the document is a syllabus or not. Such dataset will form a very small part of the internet.
  2. What seed urls do you plan to use?
  3. You mention you have relevance judgements. But it seems you only have binary labels for your training data, while evaluation if supposed to be done on the crawled data. You do not have relevance judgements for those.
  4. The linked paper is garbage and not suitable for the project
Priyesh2025 commented 2 days ago

What seed URLs You are going to use ? Ans - We are Planning to take URLs from the Website https://www.topuniversities.com/world-university-rankings This contains top university names. We will apply webscrapping and try to find URLs, Or we will do this manually, this will not be very tough. We guaranty that It will be done.

How do you plan to "Crawl" data relevant to such a task? Taking URL from seed link and putting it in priority queue (In papers they call URL Frontier). Downloading that document and extracting the links from it and if it's already downloaded we will not download it.

One more thing is that searching entire web is really complex, to avoid this we are going with the simplest method which is prioritizing links with fewer slashes (URLs which are more root like ). And also one which is more similar with our query, for example, checking for keywords in the URL that contains terms like "syllabus" "programming" etc. Eventually identification of these documents will be performed by a Machine Learning algorithm, but to narrow down our search space we need to do this, also if it becomes too longer we will limit crawling to certain number of documents for example after k number of documents crawling will stop, which will not highly affect recall because priority wise potential URLs will be at starting after reordering.

Here is one more paper I found in which how to give "Importance" to certain URLs are given and also different algorithm for that, https://www.researchgate.net/publication/222489237_Efficient_Crawling_Through_URL_Ordering

Relevance Judgement For Crawled Data - Yes sir, you are right we don't have relevance judgement for test data which is documents coming from web but we were thinking to estimate manually for example if we have retrieved 10000 of documents we will take random sample of 100 from them and then check how many of them are actually syllabus and from that we were planning to measure Precision. Do this few times and get average. ( We tried to find but couldn't find how to automate this task )

Also one thing we needed to clarify is that we don't have dedicated one research paper for overall idea of the project, we are just finding nuts and bolts from different papers.

parth126 commented 13 hours ago
  1. The crawling strategy makes no sense. Limiting to certain steps might not even fetch you relevant documents.

  2. As mentioned multiple times annotating your own dataset with relevamce judgements is not allowed. Plus you have no idea how many documents are required for proper evaluation. 100 is not enough.

  3. The task is poorly defined. Slabus webpages will have the keyword "syllabus" most of the time, making the task too trivial.

  4. There is no use case of such a task. Why would anyone want to train a "classifier" for identifying syllabus?

This project is not worth pursuing. Please make an alternate proposal.