sanger96 / Happenings_Team-5_UTD_Senior_Design_Project

UTD Senior Design Project; Group Members: Gaurav Sanger, Jonathan Lam, Robert Dohm, Landin Kasti, Charles Eaton
3 stars 0 forks source link

Implement PageScraperService.java addNewEvents() method #53

Closed LKASTI closed 5 months ago

LKASTI commented 5 months ago

The addNewEvents() method returns a list of Event objects created from the data given in the HTML. The Event objects should also contain Location and Appointment, and, if applicable, Club objects nested inside the Event object.

  1. Call scrapePageItems() to update pageItems field.
  2. Iterate through pageItems to retrieve the "Calendar" page URL
  3. All necessary data should be scraped from the page for an Event, Location, Appointment, and Club. note(s):
    • Location data might or might not be stated on the page
    • Appointment data might or might not be stated on the page
    • Club data might or might not be stated on the page
    • Not yet sure if all objects can be stored nested under the Event Object
  4. Return the list of Events created

Links: "This Week" Example "Calendar" Page

Location in "This Week" HTML of Events: <u class="lw lw_event_list">

Jsoup: https://jsoup.org/apidocs/

LKASTI commented 5 months ago

I have started this method's implementation that can be viewed by pulling my branch 'Landin' from the remote repository. I also corrected some needed changes on the scrapePageItems()method needed to properly execute addNewEvents().

Please be sure to check my branch before continuing any changes

LKASTI commented 5 months ago

Created helper get methods for retrieving data from each event item. Added printing the stack trace to try-catch block for debugging purposes. Created TODO comments for future work.

I still have not tested creating Event objects with nested Location and Appointment objects, and I'm not sure if we would do that in the service file or controller. I assume it would be done in the service file to separate implementation from the controller, from a design perspective. Might have to rethink design, again.

Code was pushed to my branch 'Landin'.

LKASTI commented 5 months ago

Updated with more TODO and clarity comments. Changed design so that it is returning a string explaining how many events were added, if any were. This method will post and flush new events to the database.

rwdohm83 commented 5 months ago

I was able to parse out buildings and room numbers from the location string. It works almost all the time, it will miss the building if it is not surrounded by parenthesis. The room number is pretty reliable when there is a building with some exceptions.

Maybe I can improve the algorithm.

jonathan-jlam commented 5 months ago

I made some strides into optimizing Robert's parser.

I saw that the parser was iterating through every character in each location which I thought might be a bit slow. Also I saw that the parser was building the room numbers one digit/period at a time, and I think we can probably just greedily take the first string with a period that resembles a room number. I think the optimizations might make it a bit faster. Also I have coverage for a variety of cases including:

  1. Parenthesis contains strings that are the building AND room
  2. Parenthesis contains nothing useful
  3. Room and building are not in parenthesis

The general methodology behind the approach:

We find something in parenthesis -- this might be just the name of the building (ECS), the building and room (ECS 2.204) or tokens we do not care about (Galaxy Rooms). If the string in the parenthesis is long, then we know we got more than a building. We can check the second value in the parenthesis to see if it contains a period, and if so, we guess that we have a building and room combo. If there is no period, we probably got junk in the parenthesis.

If we have no room number after the above step, then we need to find the room number somewhere. We look through the location and take the first sufficiently long string with a period and guess that it is the room number. Once we find this, we check to see if we got a building value (we won't have one if the location has no parenthesis). If we did not get a building value, then we take the string preceding the room number and guess that it is the building.

Finally, if we still have no building, chances are it might be in the first token, so guess that it is the building.

Note whenever I say "guess" there is some condition that governs good guesses.

I find that this works for a lot of cases, but it is still slow. Moving forward, also, I think maybe we just leave out events that are not populated with either a building or building-room combo. Stuff like Virtual Events, events in Chess Plaza, Dining Hall. We can discuss.