uprm-inso4101-2024-2025-s1 / semester-project-regiupr

semester-project-regiupr created by GitHub Classroom
8 stars 2 forks source link

Creating a web scrapping & parsing tool #81

Closed cristianMarcial closed 1 month ago

cristianMarcial commented 1 month ago

Objective: Create a parser that can segregate and concatenate the information on a web page obtained by a web scrapper and place it neatly in a file to be imported into our application using a parser.

Description: Using python libraries that extract the source html code of the page to be scraped from a URL, we will tokenize the information extracted from the page about the courses of this semester scraped using a parser, and we will create the code to segregate that information into a separate document which can be read by our application.

Requirements:

  1. Extract the html from each page with the web scrapper.
  2. Separate the source code obtained with the scrapper between garbage and useful code.
  3. Separate the useful code of the courses and divide their information into tokens (such as their department, professor, sections, schedule, etc.)
  4. Be able to contain this information in a variable.

Time Constrains: Not after October 8th.

Completion Criteria: Comply with everything listed in the requirements section.

Difficulty: 6, 1 for creating the web scrapper and 5 for being able to create the parser.

Priority: 4

cristianMarcial commented 1 month ago

Note: corrected the first paragraph. It said "web scrapper" when it should say "parser"

cristianramos9 commented 1 month ago

I think this task can be split into two or three tasks.

  1. Create web scraping tool - will work on creating a way to get required web pages and create local copies for when web pages may not be accessible (local copies can be used for testing that desired information is been received, keeping the copies may depend on database).
  2. Create parsing tool - will work on accessing and processing data acquired by web scraping tool.
  3. Once web scraping and parsing tools are created, actually implement them to extract desired information and get it ready for use in the database.

What do you think?

cristianramos9 commented 1 month ago

Data from each column on a single row can now be extracted. Next two steps: sanitize some of the data, as desired data is mixed together or have unnecessary data and needs cleaning; iterate through all rows and save extracted data to dictionaries.

cristianMarcial commented 1 month ago

@cristianramos9 thanks for the effort but I've already implemented the parser using your previous code. The tasks can be considered as complete @Ar2691 @michellebou .

Ar2691 commented 1 month ago

@cristianMarcial its missing team leader approval

Ar2691 commented 1 month ago

Also show proof source code of the implemantation that was done.

cristianMarcial commented 1 month ago

@Ar2691 Implementation of section catalog parser | Proof Screenshot (1056) Screenshot (1057) Screenshot (1058) Screenshot (1059)

cristianMarcial commented 1 month ago

the for loop isn't part of the implementation, it's only there to show that the variable "section_catalog" has the section catalog and can be used by the exporter.py