titipata / penn-events-calendar

University of Pennsylvania events with search and recommendation engine
http://35.160.123.103/
MIT License
11 stars 3 forks source link

[Data] Scrape list of these events #34

Open titipata opened 5 years ago

titipata commented 5 years ago

The event is available at https://web.sas.upenn.edu/mindcore/events/. Not sure if there is an easy to download format for the page or not.

titipata commented 5 years ago
bluenex commented 5 years ago

For MindCORE, there is a link to download all events as iCal format. Shall we try parsing with this https://pypi.org/project/icalendar/?

titipata commented 5 years ago

@bluenex yes, using icalendar should be proper. Also, we're actually hiring people to find all sources of events now. I will forward you an email later this week.

Tanvvenk commented 5 years ago

10/26/18

titipata commented 5 years ago

Thanks so much @Tanvvenk! You can keep updating the list here!

bluenex commented 5 years ago

Oh my, that is such a big list 😮

titipata commented 5 years ago
titipata commented 5 years ago

@bluenex is there a way to extract calendar from this page https://www.lrsm.upenn.edu/calendar/?

Tanvvenk commented 5 years ago

@bluenex @titipata I tried to scrape the following website: http://go.activecalendar.com/UPennMINS/?view=list2&search=y However, the event url is not in PageSoup. Can someone check if there is a hidden link? Thank you.

Tanvvenk commented 5 years ago

@bluenex @titipata I tried to scrape the following website: https://www.law.upenn.edu/institutes/legalhistory/workshops-lectures.php

However, the href for the event url is different from the actual event_url. They link to the external javascript to get to the event-page. Could someone check this? Thank you.

titipata commented 5 years ago

@bluenex, so they use href in the following format:

<a href="/live/events/50930-legal-history-workshop-intisar-a-rabb">Legal History Workshop: Intisar A. Rabb</a>

Then, they use Javascript to call the event in the following format: https://www.law.upenn.edu/newsevents/calendar.php#!event_id/50930/view/event. I guess we can check the event ID and a way to retrieve JSON from the event ID as we did previously.

bluenex commented 5 years ago

@bluenex @titipata I tried to scrape the following website: https://www.law.upenn.edu/institutes/legalhistory/workshops-lectures.php

However, the href for the event url is different from the actual event_url. They link to the external javascript to get to the event-page. Could someone check this? Thank you.

@Tanvvenk so basically, you got a tag like @titipata mentioned:

<a href="/live/events/50930-legal-history-workshop-intisar-a-rabb">Legal History Workshop: Intisar A. Rabb</a>

The /live/events/50930-legal-history-workshop-intisar-a-rabb in href is used to request JSON data to fill the event page template. Because we actually need event data, it would be better if we can get JSON data directly (meaning we don't even need to scrape from the HTML).

You can use https://www.law.upenn.edu/newsevents/calendar.php#!event_id/50930/view/event as a pattern to get a link to event page, but for event data you can check page's request which you will see this link:

https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%20priority%3D%22high%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ELibrary%20Hours%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ELibrary%20Event%20Private%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3ECareers%20Calendar%20ONLY%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EDocs%20Only%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EPending%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EPrivate%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EOffcampus%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ERegistrar%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3EAdmitted%20JD%3C%2Farg%3E%3Carg%20id%3D%22placeholder%22%3ESearch%20Calendar%3C%2Farg%3E%3Carg%20id%3D%22disable_timezone%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22modular%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3Eweek%3C%2Farg%3E%3C%2Fwidget%3E

which is an encoded url of a bunch of XML tags:

https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=<widget type="events_calendar" priority="high">
  <arg id="mini_cal_heat_map">true</arg>
  <arg id="exclude_tag">Library Hours</arg>
  <arg id="exclude_tag">Library Event Private</arg>
  <arg id="exclude_tag">Careers Calendar ONLY</arg>
  <arg id="exclude_tag">Docs Only Event</arg>
  <arg id="exclude_tag">Pending Event</arg>
  <arg id="exclude_tag">Private Event</arg>
  <arg id="exclude_tag">Offcampus</arg>
  <arg id="exclude_group">Registrar</arg>
  <arg id="exclude_group">Admitted JD</arg>
  <arg id="placeholder">Search Calendar</arg>
  <arg id="disable_timezone">true</arg>
  <arg id="thumb_width">144</arg>
  <arg id="thumb_height">144</arg>
  <arg id="modular">true</arg>
  <arg id="default_view">week</arg>
</widget>

So the conclusion is that we should be able to retrieve event data (any event that shares the same XML configs) from the encoded URL. In other words, you can get event data in JSON from this python snippets:

import requests

url = '''
  https://www.law.upenn.edu/live/calendar/view/event/event_id/50930?user_tz=IT&syntax=
  %3Cwidget%20type%3D%22events_calendar%22%20priority%3D%22high%22%3E%3
  Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%
  22exclude_tag%22%3ELibrary%20Hours%3C%2Farg%3E%3Carg%20id%3D%22exclud
  e_tag%22%3ELibrary%20Event%20Private%3C%2Farg%3E%3Carg%20id%3D%22excl
  ude_tag%22%3ECareers%20Calendar%20ONLY%3C%2Farg%3E%3Carg%20id%3D%22ex
  clude_tag%22%3EDocs%20Only%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclud
  e_tag%22%3EPending%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22
  %3EPrivate%20Event%3C%2Farg%3E%3Carg%20id%3D%22exclude_tag%22%3EOffca
  mpus%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ERegistrar%3C%2Far
  g%3E%3Carg%20id%3D%22exclude_group%22%3EAdmitted%20JD%3C%2Farg%3E%3Ca
  rg%20id%3D%22placeholder%22%3ESearch%20Calendar%3C%2Farg%3E%3Carg%20i
  d%3D%22disable_timezone%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_w
  idth%22%3E144%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E144%3C%2F
  arg%3E%3Carg%20id%3D%22modular%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22
  default_view%22%3Eweek%3C%2Farg%3E%3C%2Fwidget%3E
'''
resp = requests.get(url=url)
data = resp.json()

data
Out[5]:
{'date': 'Date',
 'title': 'March 17, 2016',
 'event': {'id': 50930,
...

Thank you very much! I will try this out @bluenex

Tanvvenk commented 5 years ago

@bluenex @titipata Is information being drawn from an external or hidden link for the following: https://penntoday.upenn.edu/events

Page.soup is giving unexpected information that doesn't seem to contain event_url href

Thank you

bluenex commented 5 years ago

@bluenex @titipata Is information being drawn from an external or hidden link for the following: https://penntoday.upenn.edu/events

Page.soup is giving unexpected information that doesn't seem to contain event_url href

Thank you

@Tanvvenk checking XHR of the page, it sent a request to this link:

https://penntoday.upenn.edu/events-feed?_format=json

You get JSON from that right away.

titipata commented 5 years ago

Nice, thanks @bluenex!

Extra questions So, we have a few more URLs to scrape from Med events which I couldn't get the event id out. These include https://www.gse.upenn.edu/event, https://events.med.upenn.edu/, and https://www.med.upenn.edu/inclusion-and-diversity/Upcoming-Events/. I can get event's details using the following code but cannot use BeautifulSoup to get the event_id:

event_id = '701084'
text = """https://events.med.upenn.edu/live/calendar/view/event/event_id/{}?user_tz=IT&synta
x=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Fa
rg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%
2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efa
lse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%
3Eday%3C%2Farg%3E%3Carg%20id%3D%22group%22%3ECenter%20for%20Clinical%20Epidemiology%20and%20Bios
tatistics%20%28CCEB%29%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20ar
e%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20fi
le%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdm
in%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22excl
ude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Fa
rg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3
E%3C%2Fwidget%3E""".format(event_id).replace('\n', '').replace(' ', '')
event_json = requests.get(text).json()
bluenex commented 5 years ago

@titipata

https://www.gse.upenn.edu/event

This one is weird that it doesn't make a request at all. However, all the data for events seems to be packed in a HTML file. You should be able to get all the event data from <div class="views-row views-row-{n} views-row-odd">. Notice that {n} is the row of the event being shown. Also notice that some of the events in HTML may not be rendered on the website. I still can't figure out what makes the event to be rendered or not.

<div class="views-row views-row-7 views-row-odd">
  <div class="views-field views-field-nothing-1">
    <span class="field-content">
      <a href="#" title="Add to Calendar" class="addthisevent">
        <img src="http://addthisevent.com/gfx/icon-calendar-t2.png" />
        <span class="_start">2018-12-10 10:30</span>
        <span class="_end">2018-12-10 10:30</span>
        <span class="_zonecode">15</span>
        <span class="_summary"
          >Penn GSE Event: Colloquium Presented by Dr. John Papay</span
        >
        <span class="_description"
          >Dr. John Papay, Assistant Professor of Education at Brown University,
          will present a colloquium titled: Where Teachers Succeed and Improve:
          The Importance of School Context for Teacher Effectiveness and
          Development.
          https://www.gse.upenn.edu/event/colloquium-presented-dr-john-papay</span
        >
        <span class="_location">3700 Walnut St, Room 203</span>
        <span class="_organizer"></span> <span class="_organizer_email"></span>
        <span class="_date_format">DD/MM/YYYY</span>
      </a>
    </span>
  </div>
  <div class="views-field views-field-nothing">
    <span class="field-content">
      <span class="event-month-date">
        <span class="date-display-single">Dec</span>
      </span>
      <span class="event-day-date">
        <span class="date-display-single">10</span>
      </span>
    </span>
  </div>
  <div class="views-field views-field-field-date-4">
    <div class="field-content">
      <span class="date-display-single">1544455800</span>
    </div>
  </div>
  <div class="views-field views-field-title">
    <span class="field-content">
      <a href="/event/colloquium-presented-dr-john-papay">
        <h2>Colloquium Presented by Dr. John Papay</h2>
      </a>
    </span>
  </div>
  <div class="views-field views-field-field-date-5">
    <div class="field-content">
      <span class="glyphicon glyphicon-time"></span>
      <span class="date-display-single">10:30am</span>
    </div>
  </div>
  <div class="views-field views-field-field-location-1">
    <div class="field-content">
      <span class="glyphicon glyphicon-map-marker"></span> 3700 Walnut St, Room
      203
    </div>
  </div>
  <div class="views-field views-field-term-node-tid">
    <span class="field-content">
      <div class="item-list">
        <ul>
          <li class="first">On Campus</li>
          <li>Students</li>
          <li>Faculty</li>
          <li>Staff</li>
          <li class="last">Lecture/Panel/Presentation</li>
        </ul>
      </div>
    </span>
  </div>
  <div class="views-field views-field-field-event-short-description-1">
    <div class="field-content">
      Dr. John Papay, Assistant Professor of Education at Brown University, will
      present a colloquium titled: Where Teachers Succeed and Improve: The
      Importance of School Context for Teacher Effectiveness and Development.
    </div>
  </div>
</div>

The URL to specific event can be found on href at:

<div class="views-field views-field-title">
    <span class="field-content">
      <a href="/event/colloquium-presented-dr-john-papay">
        <h2>Colloquium Presented by Dr. John Papay</h2>
      </a>
    </span>
  </div>

Still, it doesn't make request and you will need to scrape from HTML anyway.

https://events.med.upenn.edu/

For this, I found that if we click on List it will call this url https://events.med.upenn.edu/#!view/all and then the following request is made (notice ...calendar/view/all?):

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3E%3C%2Farg%3E%3Carg%20id%3D%22group%22%3E%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20are%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20file%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdmin%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Farg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3E%3C%2Fwidget%3E

Response from a link above is a JSON object containing 50 events out of event_count starting from today. What you may want to know are events and event_count.

https://www.med.upenn.edu/inclusion-and-diversity/Upcoming-Events/

When we open a link to an event in this page we will be brought to another page of events.med.upenn.edu that is https://events.med.upenn.edu/inclusion-and-diversity/#!view/all. This page sends a request with the same format but different value of <arg id="group"> as you can see below:

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=%3Cwidget%20type%3D%22events_calendar%22%3E%3Carg%20id%3D%22mini_cal_heat_map%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22thumb_width%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22thumb_height%22%3E200%3C%2Farg%3E%3Carg%20id%3D%22hide_repeats%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_groups%22%3Efalse%3C%2Farg%3E%3Carg%20id%3D%22show_tags%22%3Etrue%3C%2Farg%3E%3Carg%20id%3D%22default_view%22%3Eday%3C%2Farg%3E%3Carg%20id%3D%22group%22%3EOffice%20of%20Inclusion%20and%20Diversity%3C%2Farg%3E%3C%21--start%20excluded%20groups--%3E%3C%21--%20if%20you%20are%20changing%20the%20excluded%20groups%2C%20they%20should%20also%20be%20changed%20in%20this%20file%3A%0A%09%2Fpublic_html%2Fpsom-excluded-groups.php--%3E%3Carg%20id%3D%22exclude_group%22%3EAdmin%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Import%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ELiveWhale%3C%2Farg%3E%3Carg%20id%3D%22exclude_group%22%3ETest%20Department%3C%2Farg%3E%3C%21--end%20excluded%20groups--%3E%3Carg%20id%3D%22webcal_feed_links%22%3Etrue%3C%2Farg%3E%3C%2Fwidget%3E

This may be hard to read, let's see decoded XML:

https://events.med.upenn.edu/live/calendar/view/all?user_tz=IT&syntax=<widget type="events_calendar">
    <arg id="mini_cal_heat_map">true</arg>
    <arg id="thumb_width">200</arg>
    <arg id="thumb_height">200</arg>
    <arg id="hide_repeats">false</arg>
    <arg id="show_groups">false</arg>
    <arg id="show_tags">true</arg>
    <arg id="default_view">day</arg>
    <arg id="group">Office of Inclusion and Diversity</arg>    <!--start excluded groups-->    <!-- if you are changing the excluded groups, they should also be changed in this file:
    /public_html/psom-excluded-groups.php-->
    <arg id="exclude_group">Admin</arg>
    <arg id="exclude_group">Test Import</arg>
    <arg id="exclude_group">LiveWhale</arg>
    <arg id="exclude_group">Test Department</arg>    <!--end excluded groups-->
    <arg id="webcal_feed_links">true</arg>
</widget>

I suspect that https://events.med.upenn.edu/#!view/all doesn't actually show all the events (for instance, each request will get 50 items out of 1086 starting from today). Events from Office of Inclusion and Diversity are also not included in those 50 items, instead you need to make a request with this argument <arg id="group">Office of Inclusion and Diversity</arg>.

One thing to notice:

Main event page for med.upenn: https://events.med.upenn.edu/#!view/all. Event page for Office of inclusion and Diversity: https://events.med.upenn.edu/inclusion-and-diversity/#!view/all.

or https://events.med.upenn.edu/{office or institute}/#!view/all and that {office or institute} is associated with argument in URL decoded XML <arg id="group">Office of Inclusion and Diversity</arg>.

titipata commented 5 years ago

Adding the scraper for MINS (http://go.activecalendar.com/upennmins)

titipata commented 4 years ago

Additional seminar to be added

titipata commented 4 years ago

help me @bluenex

titipata commented 4 years ago

@bluenex can you help on getting the JSON format from the following sites:

bluenex commented 4 years ago

@titipata fortunately, they are fairly simple. Don't forget to extract arguments from the request URL (if there are).

titipata commented 4 years ago

Fix and add the following events

bluenex commented 4 years ago
titipata commented 4 years ago

@bluenex a few more missing departments:

titipata commented 4 years ago