Closed LKASTI closed 5 months ago
Worked on this for an hour. I found out that I am able to retrieve classes using JSoup getElementsByClass starting from the very first class, but once I get to the class called "wprt-container" that is as deep into the nested classes I can get.
I suspect that this has something to do with the script that is introduced the the webpage at this point. This is what is returned in HTML from a doc.getElementsByClass("wprt-container") call:
<-------div id="localist-widget-42145669" class="localist-widget">
As you can see the javascript starts here. From here I think we need to figure out how to retrieve the javascript from this source and then scrape the data out of it.
@rwdohm83
I see what you're encountering. After spending a while searching through some of the network calls being made on the page, it seems like they are actually just running a script to retrieve data from another site. That site being https://calendar.utdallas.edu/calendar. By viewing just the source code, meaning no script was ran, it seems like all event data is listed there under these tags
<div class="item event_item vevent" id="event_instance_{some value}">
This is the link to view the page without any JS being used: link
Hope that helps.
Worked another 2.5 hours on this. I am able to get the first event Name and URL, but iterating through the rest has turned out to be a big hairy problem.
It seems that element traversal is done in a tree structure in jsoup. I'm burned out on coding right now so i'm going to stop for now.
Current state of the issue is I can't iterate to the next event and scrape it.
I took a look at your new implementation @rwdohm83 , and it works perfectly on my machine. However, I made a few changes that simplified the code and still retrieved the necessary data from the web page. After looking at the output, I observed many titles were cut off with an ellipses '...', full titles were included in the links themselves, and that traversal through the links
was not needed. The changes extract the title from the link already found by your code. I also utilized the thisWeekURL
field, which will extract the URL if specified in application.properties. Here is the snippet:
application.properties
website.url = https://calendar.utdallas.edu/calendar
scrapePageItems()
String testOutput = "";
try{
Document doc = Jsoup.connect(thisWeekURL).get();
// Gets the element with the class "summary" and all child elements
Elements eventList = doc.getElementsByClass("summary");
// Store the url and name of each event
for (Element event : eventList)
{
Elements links = event.select("a[href]");
String eventURL = links.attr("href");
String eventName = eventURL.substring(36).replace('_', ' ');
testOutput = testOutput + eventName + '\n';
testOutput = testOutput + eventURL + '\n' + "---------" + '\n';
}
}
catch(Exception e){
return "Exception thown:" + e.getMessage();
}
return testOutput;
Please let me know if this creates any issues or anything we should discuss.
Great!
Updated to not add events that are duplicated on the page and events that already exist in the database, given their name. A HashSet is used to store events already known/seen by their name. Added printing out stack trace in catch block.
Closing this issue for now, but optimizations could possible be added.
Reopening this issue due to the changes made to UTD websites.
The
scrapePageItems()
method is used to populate thepageItems
field by using Jsoup to iterate over event list items in the "this week" html.PageItem
containing the title and URL, and append the object topageItems
.Links: "This Week" Example "Calendar" Page
Location in "This Week" HTML of Events:
<u class="lw lw_event_list">
Jsoup: https://jsoup.org/apidocs/