Closed yrm14 closed 1 month ago
[Psossibly GPT Generate Problem Statement]
The problem is vague. How do you plan to evaluate a dataset that is crawled? Will it have relevance judgements? What will be the contribution apart from crawling the dataset?
Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.
Respected Prof. Parth Metha, my name is Yash Mashru from MnC batch 2021, student ID 202103045. I am talking in behalf of my group (group of Issue 5) . We are extremely sorry regarding the same. We would like to discuss with you our problem statement in detail by the means of Google Meet. If you would allow the same, it would be great. Hoping to hear from you soon.
Regards, Yash Mashru
On Tue, Sep 17, 2024, 19:26 Parth Mehta @.***> wrote:
[Psossibly GPT Generate Problem Statement]
The problem is vague. How do you plan to evaluate a dataset that is crawled? Will it have relevance judgements? What will be the contribution apart from crawling the dataset?
Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.
— Reply to this email directly, view it on GitHub https://github.com/parth126/IT550/issues/5#issuecomment-2355899359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZUD6LGIES4PBFGPMVZRYMLZXAYJRAVCNFSM6AAAAABOJJ2GEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJVHA4TSMZVHE . You are receiving this because you authored the thread.Message ID: @.***>
Kindly email me for making appointments. This is not the right forum to do that.
Title
An Online Document Retrieval and Classication System based on Information Retrieval and Machine Learning Techniques
Team Name
IRFighters
Email
202103045@daiict.ac.in
Team Member 1 Name
Priyesh Tandel
Team Member 1 Id
202101222
Team Member 2 Name
Keertivardhan Goyal
Team Member 2 Id
202103007
Team Member 3 Name
Yash Mashru
Team Member 3 Id
202103045
Team Member 4 Name
Sanchit Satija
Team Member 4 Id
202103054
Category
Reproducibility
Problem Statement
The project aims to develop a web crawler and document classification system using Information Retrieval (IR) and Machine Learning (ML) techniques. The system focuses on retrieving and classifying specific types of documents, such as syllabi, from the web, using seed URLs to guide the crawler and a trained classifier to identify relevant pages.
Evaluation Strategy
The system will be evaluated using precision, recall, and F1-score metrics for the classifier. Cross-validation is performed on the training data to measure the classifier's performance. For overall system evaluation, a random sample of pages classified as syllabi is manually checked to estimate system precision. The estimated recall is based on the percentage of relevant documents in a random sample of all crawled pages.
Dataset
https://github.com/yrm14/IR-Datasets/blob/main/dataset_IR.csv
Resources
[1] Andrew Luxton-Reilly, Brett A Becker, Yingjun Cao, Roger McDermott, Claudio Mirolo, Andreas Muhling, Andrew Petersen, Kate Sanders, Simon, and Jacqueline Whalley. Developing Assessments to Determine Mastery of Programming Fundamentals. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE '17, page 388, New York, NY, USA, 2017. ACM.
[2] D. Eichmann. The RBSE spider Balancing effective search against Web load. Computer Networks and ISDN Systems, 27(2):308, 1994.
[3] Allan Heydon and Marc Najork. Mercator: A scalable, extensible Web crawler. World Wide Web, 2:219{229, 1999.
[4] Leonard Richardson. Beautiful Soup Documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/, pages 1{72, 2016.
[5] Junghoo Cho. Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30(1-7):161{172, 1998.