sanger96 / Happenings_Team-5_UTD_Senior_Design_Project

UTD Senior Design Project; Group Members: Gaurav Sanger, Jonathan Lam, Robert Dohm, Landin Kasti, Charles Eaton
3 stars 0 forks source link

Update method to scrape location and time data #61

Closed rwdohm83 closed 5 months ago

rwdohm83 commented 5 months ago

Due to changes in the website we are scraping, I am going to update the event detail scraper to be more robust for website changes.

rwdohm83 commented 5 months ago

I modified the scrapePageItems method to get the title better.

The old title was parsed from the URL, but this would leave out characters such as "&", the new one relies on the scraper that filter out all links and gets the title directly from the link text attribute. This title includes special characters like "&" and uppercases appropriate words, making the title easier to search for in the event-detail page scraper.

The event detail scraper needs a highly intelligent algorithm to derive location and description information that will stand up to website changes.

I pushed my branch.

rwdohm83 commented 5 months ago

It is my opinion that parsing the html of the event detail page as a String is the better option since jsoup relies on searching by classes that will change frequently when the site is changed. If we treat the html as a giant string we can create an intelligent algorithm without jsoup. My idea is to parse out html tags and search for keywords like Details or Location, then find bodies of text that contain relevant information that may be a location for instance. The description is difficult to identify, but if we come across a Description, we know that any longer paragraphs of text following or nested would likely be a description.

rwdohm83 commented 5 months ago

I'm not really worried about computation time for this algorithm because it is intended to be run as a batch process daily.

rwdohm83 commented 5 months ago

I found that with Jsoup it is possible to get every paragraph on the page. This is easier than parsing a string.

I have a new idea to apply the location parser to every paragraph to try to extract location data. If an algorithm to parse each type of data can be applied to every paragraph only succeeding when that paragraph contains the data we are seeking, it may work.

rwdohm83 commented 5 months ago

Moved the location parser algorithm to its own method tryParseLocation()

Applied the location parser algorithm to every paragraph. It works pretty well, with the exception of the Japan event which is resulting in erroneous building and room numbers.

pushed my branch.

rwdohm83 commented 5 months ago

It may be beneficial to create an isBuilding() method that checks a string to each every building on campus.

rwdohm83 commented 5 months ago

Created isBuilding method to compare a string to every building on campus. Updated the scraper.

Locations are now being parsed correctly.

jonathan-jlam commented 5 months ago

testing the new implementation and the scraper seems to miss a lot of easily accessible room numbers that it was getting before. Will continue looking for a solution

jonathan-jlam commented 5 months ago

not sure what the problem is, must have to do with the way paragraphs are being selected that the actual location string is being missed