mozillascience / studyGroup

Gather together a group to skill-share, co-work, and create community
http://mozillascience.github.io/studyGroup/
Other
89 stars 534 forks source link

Web Scraping Fisheries and Ecological Data #81

Closed timcashion closed 7 years ago

timcashion commented 7 years ago

Web Scraping Fisheries and Ecological Data

Intro to Python (very quick intro!)

Python is VERY similar to R in how you write the code for it. There are some small changes for some methods and functions, but it is mainly the same.


x = "length"
print(len(x)) #len() rather than length()
x = [] #Defining lists is the same
x = {} #Defining dictionaries is the same 
x = "6"
x = list(x)
print(type(x))

Most code below was copied from https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/ and https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Web scraping is dependent on a few useful packages and there are more that can help you out depending on what you're doing. I learned to do this in python which is similar enough to R that you don't need to worry.

from bs4 import BeautifulSoup
import pandas as pd
import requests
from urllib.request import Request, urlopen

Web scraping works by reading the web pages in the format they were written in: html. By knowing a little html you can find the information you want and write a script that extracts those pieces and saves them for you to use later. Web scraping is done one page at a time, but can be combined with loops or other functions to run over multiple web pages at a time.

First let's try with one webpage and look for what we're interested in:

url = "http://maritime-connector.com/ship/mainichi-maru-no35-8631805/"
req = Request(url, headers= {'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()

#Your url should be a string (i.e., text within quotation marks)
#We then use the BeautifulSoup funtion that takes two arguments: a url, and a parser. 
soup = BeautifulSoup(page, "html.parser")
#soup.prettify
#The prettify method displays the html code of a website with proper indentation 
```{python}
soup.title
#Finds the first title of the webpage we're looking at
soup.p 
#Find's the first paragraph tag of the webpage we're looking at

soup.find_all('p')
#Returns all data within (and including) the paragraph tags
#Other html tags include td (table data), tr (table row), th (table header)

These examples all focus on using the 'tag' of the html. However, the person who wrote the website may have also left clues of where to find certain things.

You can also look for particular elements of the data stored under different classes or tags using BeautifulSoup. Anything with a particular 'class' or 'id' can be extracted in the same way.

#Write code here for extracting by class and id 

The first argument that the Beautiful soup function takes is a url, and the second is a parser. A parser is the way that BeautifulSoup tries to read the webpage and display it to you. They vary in speed and how they display the final result. For processsing tables, I recommend lxml as it is fast and seems to display the data properly. Try running the page with the different parsers to see potential differences in how data is displayed.


soup = BeautifulSoup(page, "html.parser")
#soup = BeautifulSoup(page, "lxml")
#soup = BeautifulSoup(page, "xml")
#soup = BeautifulSoup(page, "lxml-xml")
#soup = BeautifulSoup(page, "html5lib")
print(soup.prettify)

If you take these steps that we've just learned and add a for loop, you essentially have the method, but I will go into a second version later. But how do you know which web pages to search?

You need to find a relationship between what web pages you want and the url. For example: https://www.marinetraffic.com/en/ais/details/ships/shipid:5 https://www.marinetraffic.com/en/ais/details/ships/shipid:9 https://www.marinetraffic.com/en/ais/details/ships/shipid:13

I could save these as a list and loop over the list, or I could write a script that makes a long list for items I know change. I know these ship ids go from 1-5,000,000 but they don't appear on every number.

pages = ["http://maritime-connector.com/ship/euskadi-alai-9733480/", "http://maritime-connector.com/ship/christina-7931052/", "http://maritime-connector.com/ship/stian-andre-9677038/"]

OR

pages = []
imos = pd.read_csv("IMOs_For_Insurance")
imos = list(imos["imo"])
for i in imos:
  i = str(i)
  url = "https://www.marinetraffic.com/en/ais/details/ships/shipid:" + i 
  pages.append(url)
pages

full_table = []

So, now we know what pages we want to visit, and let's say we want everything written in a table on each of these pages.

So now we're going to write a for loop that runs through each of the web pages we're interested in. After it opens them, it will be processed by BeautifulSoup and then we'll save all data in a table row to table. We'll append this temporary 'table' to 'full_table' which at the end will be a list of all these tables. This request attempt is couched within a try and except statement: it tries to run our request to the webpage, but if it fails for some reason (false url, etc.), it just moves on and tries the next.

for url in pages:
    try: 
        req = Request(url, headers= {'User-Agent': 'Mozilla/5.0'})
        page = urlopen(req).read()
        soup = BeautifulSoup(page, "lxml")
        table = soup.find_all("tr")
        full_table.append(entry)
    except:
        pass

Let's see what our data looks like

full_table[1]

This method gets all of the data, but the result is messy. You have everything, but you might not know where in the list it is, or how to format it easily. You can run a lot of different code here depending on how it looks to split the

string = str(full_table[1])
string = string.replace("\", "")
string = string.replace("<td>", "")
list = string.split("<tr>")
print(list)

You could then nest these within a for loop to run over the full list of tables in your full_table.

new_table = []
for table in full_table:
  string = str(table)
  string = string.replace("\", "")
  string = string.replace("<td>", "")
  list = string.split("<tr>")
  new_table.append(table)

So, I've started saving the data to dictionary entries as I go. Before saving the data, I create a new blank entry called 'entry'. I define all the lists I want to appear, and set their initial value to 'None' (the python equivalent of Null)

entry = {
            'name': None, 
            'imo': None, 
            'flag': None, 
            'port': None, 
            'year_built': None, 
            'shipyard': None, 
            'gross_tonnage': None, 
            'company': None, 
            'managing_company': None, 
            'class_society': None 
        }

Then I read the html to find what signifies this element is coming next, and the table head (th) gives an indication of the type of data. Thus, when the table header is equal to something, I know I want that data and what kind it is. I use this method which converts the table header of each element in our temporary 'table' to a string, then searches within the string (using the .find method) for a particular string. The find method returns -1 if it was not found, and an index of where it was found if it was. Writing these as if statements means they only execute when the correct text is found.

print(table[1])
print(table[1].th)
print(table[2].td)
print(str(table[1].td).find("Name of the ship"))
if str(table[1].td).find("Name of the ship") > 0 :
  #Do something 
  entry["name"]=(table[b].td)

for url in pages:
    try: 
        req = Request(url, headers= {'User-Agent': 'Mozilla/5.0'})
        page = urlopen(req).read()
        soup = BeautifulSoup(page, "lxml")
        table = soup.find_all("tr")
        entry = {
            'name': None, 
            'imo': None, 
            'flag': None, 
            'port': None, 
            'year_built': None, 
            'shipyard': None, 
            'gross_tonnage': None, 
            'company': None, 
            'managing_company': None, 
            'class_society': None, 
        }
        for b  in range(0, (len(table))):
            if str(table[b].th).find("Name of the ship") > 0 :
                entry["name"]=(table[b].td)
            if str(table[b].th).find("IMO number")  > 0 :
                entry["imo"]=(table[b].td)
            if str(table[b].th).find("Flag:")   > 0:
                entry["flag"]=(table[b].td)
            if str(table[b].th).find("Home port")   > 0: 
                entry["port"]=(table[b].td)
            if str(table[b].th).find("Year of build")   > 0: 
                entry["year_built"]=(table[b].td)
            if str(table[b].th).find("Builder")   > 0:
                entry["shipyard"]=(table[b].td)
            if str(table[b].th).find("Gross tonnage")   > 0:
                entry["gross_tonnage"]=(table[b].td)
            if str(table[b].th).find("Manager &amp; owner")   > 0:
                entry["company"]=(table[b].td)
            if str(table[b].th).find("Manager")   > 0:
                entry["managing_company"]=(table[b].td)
            if str(table[b].th).find("Class society")   > 0:
                entry["class_society"]=(table[b].td)
        full_table.append(entry)
    except:
        pass

However, 'for' loops are slow and this process can run faster if you batch your request calls. I have tried this with the below method, but I have not found it to be faster. However, it's included here in case you are interested. The first part defines the parameters (much like what we've done), but within a function called fetch. You then run a for loop with pool.imap that takes the fetch function, and your url list as arguments.


def fetch(url):
    try: 
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        page = urlopen(req).read()
        soup = BeautifulSoup(page, "lxml")
        table = soup.find_all("tr")
        full_table.append(table)
    except:
        pass
pool = eventlet.GreenPool()

for url in pool.imap(fetch, urls):
    print("Got me some data!")

Finally, some web developers or data providers have kindly built an API for you to access their data. The Sea Around Us is one of them. APIs function slightly differently with how you request data, but the gist is the same.

You run these through the requests package and the get function within it. Your single argument for this is the link you want to extract from, unless you're looping over multiple. Your store this as a reponse which is whether the website responded to your request. You've all seen these before when you get a website 404 or something. These are failed responses. You can read the response using the json method. of the json data type. JSON stores your values as a dictionary or list, and usually within one of the dictionary entries is the info you want. For our example, we want the 'data' dictionary entry, and then the 4th item from the list within it.

#SAU_api = 
response = requests.get("http://api.seaaroundus.org/api/v1/eez/marine-trophic-index/?region_id=8")
json_data = response.json()
json_data
value = json_data['data']
data = value[3]

Below is a more advanced version of this to store all the values and years separately so they can be easily written to a csv file.

#Marine Trophic Index Extraction
full_table = []
for url in urls_full:
    response = requests.get(url)
    json_data = response.json()
    json_data
    value = json_data['data']
    data = value[4]['values']
    years = []
    mti_value = []
    for item in data:
        year = item[0]
        value = item[1]
        years.append(year)
        mti_value.append(value)
    number = url.split("=")
    number = number[1]
    entry= []
    entry = [number, years, mti_value]
    full_table.append(entry)

Now, what you all probably actually want: FishBase. Here's my first attempt from earlier today for trophic level from fishbase. It takes a fair bit of searching but there is likely a better way. I stopped when I got to the reference link, TL, and standard error for Pacific herring, but this could be re-run modifying the id of the link, and with more spliting and accessing based on the lists.

url = "http://fishbase.org/Summary/SpeciesSummary.php?ID=1520&AT=pacific+herring"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, "lxml")
soup.prettify
table = soup.find_all("div")
data = str(table)
data = data.split("Trophic Level")
string = str(data[4])
split = string.split("Start resilience")
tl = split[0]
print(tl)