audit scraping tutorial (and audit the HTML)

hannesdatta commented 1 year ago

Background:

We've built our site so others can learn how to scrape. But, we've never actually tried scraping it ourselves!

The purpose of this task is to build a "scraping tutorial" for the site, BUT ALSO revise our HTML templates to make the site "scraping-friendly".

We need to ensure that we cover a range of "identifiers" to get data from the site. This should be

TAGS (e.g., "h1")
CLASS NAMES (e.g., class = 'artist')
as well as attribute-value pairs ( id = 123 ).

Further, we need to ensure students can extract information (1) from the TEXT attributes of HTML, (2) as well as from attribute-values.

Deliverable:

A tutorial in Python, that easily teaches anyone how to scrape our site using BeautifulSoup. As an example, see this tutorial.
This tutorial can initially be tried out in Jupyter Notebook. Later on, we will directly add it to our site.
Running into "weird" things with scraping? Or do you think our HTML templates are not yet good enough? Then give feedback about the HTML source code so we can improve it.

Next steps:

Upon the approval of the tutorial, we can directly put it on our site using the article HTML template.
- Another step would be to develop the same tutorial using R.

fleurlemire commented 1 year ago

Hi Hannes,

This is what i included so far, i can not add it in here since it is a jupyter notebook. I will send the right version via email (since it looks like images are not working well in the colab), but will add a google colab in here two: https://colab.research.google.com/drive/1F64Po-c3weJAm_ZrABAQRBJzXS-Qj5y4?usp=sharing

I did not include the recently played or top 10 songs for users and artist since i only can scrape the table as a whole but i have code for that if we want to include it later. I did not include song information yet, since that page caused an error. I can try some things and add that if we want since i guess that one is a little more difficult. I also did not add code to save it as pd dataframe yet. I can include that if we want to.

I can also remove things if some things are already too extensive.

hannesdatta commented 1 year ago

Hi @fleurlemire - please commit your work directly on our github repository for this project. You can create a new folder (say: tutorials) as a root directory. Let me know please.

fleurlemire commented 1 year ago

Hi @hannesdatta, when i try to, i get an error message saying permission denied when I try to commit.

hannesdatta commented 1 year ago

You should now have push access. Can you try again?

fleurlemire commented 1 year ago

It is added! @hannesdatta

fleurlemire commented 1 year ago

Hi Hannes, i added some extra information, including how to save the information and uploaded it.

tilburgsciencehub / music-to-scrape

audit scraping tutorial (and audit the HTML) #23