sanger96 / Happenings_Team-5_UTD_Senior_Design_Project

UTD Senior Design Project; Group Members: Gaurav Sanger, Jonathan Lam, Robert Dohm, Landin Kasti, Charles Eaton
3 stars 0 forks source link

Implement PageScraperService.java scrapePageItems() method #51

Closed LKASTI closed 5 months ago

LKASTI commented 5 months ago

The scrapePageItems() method is used to populate the pageItems field by using Jsoup to iterate over event list items in the "this week" html.

  1. A title is retrieved then checked if it exists in the database. 1a. If it exists then there's no need to add it, so can skip to next iteration.
  2. Otherwise get the URL of the calendar page containing the event.
  3. Instantiate a PageItem containing the title and URL, and append the object to pageItems.

Links: "This Week" Example "Calendar" Page

Location in "This Week" HTML of Events: <u class="lw lw_event_list">

Jsoup: https://jsoup.org/apidocs/

rwdohm83 commented 5 months ago

Worked on this for an hour. I found out that I am able to retrieve classes using JSoup getElementsByClass starting from the very first class, but once I get to the class called "wprt-container" that is as deep into the nested classes I can get.

I suspect that this has something to do with the script that is introduced the the webpage at this point. This is what is returned in HTML from a doc.getElementsByClass("wprt-container") call:

<-------div id="localist-widget-42145669" class="localist-widget">

As you can see the javascript starts here. From here I think we need to figure out how to retrieve the javascript from this source and then scrape the data out of it.

LKASTI commented 5 months ago

@rwdohm83 I see what you're encountering. After spending a while searching through some of the network calls being made on the page, it seems like they are actually just running a script to retrieve data from another site. That site being https://calendar.utdallas.edu/calendar. By viewing just the source code, meaning no script was ran, it seems like all event data is listed there under these tags <div class="item event_item vevent" id="event_instance_{some value}">

This is the link to view the page without any JS being used: link

Hope that helps.

rwdohm83 commented 5 months ago

Worked another 2.5 hours on this. I am able to get the first event Name and URL, but iterating through the rest has turned out to be a big hairy problem.

It seems that element traversal is done in a tree structure in jsoup. I'm burned out on coding right now so i'm going to stop for now.

Current state of the issue is I can't iterate to the next event and scrape it.

LKASTI commented 5 months ago

I took a look at your new implementation @rwdohm83 , and it works perfectly on my machine. However, I made a few changes that simplified the code and still retrieved the necessary data from the web page. After looking at the output, I observed many titles were cut off with an ellipses '...', full titles were included in the links themselves, and that traversal through the links was not needed. The changes extract the title from the link already found by your code. I also utilized the thisWeekURL field, which will extract the URL if specified in application.properties. Here is the snippet:

application.properties website.url = https://calendar.utdallas.edu/calendar

scrapePageItems()

    String testOutput = "";

    try{
        Document doc = Jsoup.connect(thisWeekURL).get();

        // Gets the element with the class "summary" and all child elements
        Elements eventList = doc.getElementsByClass("summary");

        // Store the url and name of each event
        for (Element event : eventList)
        {
            Elements links = event.select("a[href]");
            String eventURL = links.attr("href");
            String eventName = eventURL.substring(36).replace('_', ' ');

            testOutput = testOutput + eventName + '\n';
            testOutput = testOutput + eventURL + '\n' + "---------" + '\n';
        }
    }
    catch(Exception e){
        return "Exception thown:" + e.getMessage();
    }

    return testOutput;

Please let me know if this creates any issues or anything we should discuss.

rwdohm83 commented 5 months ago

Great!

LKASTI commented 5 months ago

Updated to not add events that are duplicated on the page and events that already exist in the database, given their name. A HashSet is used to store events already known/seen by their name. Added printing out stack trace in catch block.

LKASTI commented 5 months ago

Closing this issue for now, but optimizations could possible be added.

LKASTI commented 5 months ago

Reopening this issue due to the changes made to UTD websites.