sys-bio / temp-biomodels

Temporary place for coordination of updating existing Biomodels
Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

Add curators notes and images #103

Closed jonrkarr closed 2 years ago

jonrkarr commented 2 years ago

I think it would be helpful to have this here so it can also be corrected

Todo

This should be done by

freiburgermsu commented 2 years ago

The <div class="ebiLayout_reducewidth"> == $0 child in the HTML of this model is not present in the requests text from this code

from xml.etree import ElementTree as et
from bs4 import BeautifulSoup
from glob import glob
import requests
import os, re

for model_path in glob(os.path.join('final', '*')):
    model_id = re.search('(?<=\\\\)(.+$)', model_path).group()
    url = f'https://www.ebi.ac.uk/biomodels/{model_id}#Curation'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    with open(f'{model_id}_curation_page.txt', 'w', encoding = 'utf-8') as out:
        out.write(soup.prettify())

The downloaded HTML text is attached: BIOMD0000000001_curation_page.txt.

How can the necessary div, that possesses the curation notes, be scraped as well? Does requests have a limitation to the number of children that are scraped?

jonrkarr commented 2 years ago

I'm not sure I understand the question.

The requests package can be used for as many URLs as you like.

soup can also be queried for as many HTML elements as necessary. E.g., a = soup.find(id='a'); b = soup.find(id='b'); ....

freiburgermsu commented 2 years ago

My question is how can the <div class="ebiLayout_reducewidth"> == $0 be scraped? This div describes the area of the BioModels page that contains the curation notes, yet, it does not appear in the scraped content from requests, per the txt file from my previous post. Why is this the case?

I will search through StackOverflow for inspiration.

jonrkarr commented 2 years ago

I don't understand. The curator's comments are at the right side of your screen shot. You need to drill down further into the HTML to figure out the element that encapsulates the note.

freiburgermsu commented 2 years ago

I found the HTML on the website that possesses the curation notes Edelstein1996 - EPSP ACh event _ BioModels - Brave 13-Mar-22 15_24_47 however, this is not found within the scrape from requests. Only the first div at this level in the HTML tree (class="row") is in the requests scrape; not the div (class="ebiLayout_reducewidth") that contains the curation notes. div classes

The location in the scrape where the <div class="ebiLayout_reducewidth" should be possesses text which seems to suggest that requests is unable to access the content. possible error notations

freiburgermsu commented 2 years ago

This error (described here and here) seems to be resolved through a multipart argument in requests. I will investigate this workaround further.

jonrkarr commented 2 years ago

The HTML page that you attached above displays an error. If you open this in your browser, you can clearly see the error.

Unsupported Media Type If you are using a web browser You may be using an old browser, or a configuration setting may prevent it from asking for HTML content. For the best experience, please use a recent version of Firefox or Chrome. If you continue to encounter the issue after upgrading, please file a bug report including the page you tried to access and the web browser you are using.

If you are using our API We only support JSON and XML response types. You can either

  • include an Accept header in your request with the mime types you prefer, for instance: Accept: application/json,application/xml
  • indicate the preferred mime type using the 'format' request parameter: /endpoint1?format=xml /endpoint2?someParam=foo&format=json

When you make the request, you need to explicitly specify that the accept type should be application/html:

requests.get(..., header={'accept': 'application/html'})
freiburgermsu commented 2 years ago

Thank you @jonrkarr, this cracked the code.

The curation notes are saved in txt files with a naming schema of {BioModel ID}_curation_notes. The following models do not possess curation notes, and thus no file of curation notes was created for these models: ..\temp-biomodels\final\BIOMD0000000014 ..\temp-biomodels\final\BIOMD0000000018 ..\temp-biomodels\final\BIOMD0000000026 ..\temp-biomodels\final\BIOMD0000000029 ..\temp-biomodels\final\BIOMD0000000030 ..\temp-biomodels\final\BIOMD0000000034 ..\temp-biomodels\final\BIOMD0000000038 ..\temp-biomodels\final\BIOMD0000000041 ..\temp-biomodels\final\BIOMD0000000049 ..\temp-biomodels\final\BIOMD0000000070 ..\temp-biomodels\final\BIOMD0000000076 ..\temp-biomodels\final\BIOMD0000000095 ..\temp-biomodels\final\BIOMD0000000103

120 exhibits the requests code and the scraped curation notes.

freiburgermsu commented 2 years ago

@jonrkarr I receive this error InvalidURL: Failed to parse: data:image from the requests.get... line

for img_file in div.find_all('img'):    
    image = requests.get(f'http://{img_file["src"]}')
    image.raw.decode_content = True
    with open(os.path.join(directory, f'{model_id}_curation_image.png'), 'wb') as out:
        shutil.copyfileobj(image.raw, out)

when scraping the curation images that have this URL syntax http:// .... The URL is correct, since it renders when I search the URL (minus the preceding http://) in my browser.

StackOverflow does not seem to have any pertinent posts to resolve the error (this post seems to describe the same error from the same BioModels URL syntax, but no solution is provided). Do you have suggestions?

jonrkarr commented 2 years ago

http://... means that the source of the image is provided inline, rather than at an external URL. The first part after data: indicates the internet media type. In this case its image/jpeg. The next part (`base64') indicates the encoding.

Rather than attempting to retrieve the source from another URL, the content simply needs to be decoded and saved to a file.

This can be done roughly like this:

import base64
for img_file in div.find_all('img'):
    img_format = img_file['src'].partition('/')[2].partition(';')[0]
    img_data = base64,b64decode(img_file['src'].partition(',')[2])
    img_filename = os.path.join(directory, f'{model_id}_curation_image.{img_format}')
    with open(img_filename, 'wb') as out:
        out.write(img_data)
jonrkarr commented 2 years ago

This page explains the method that BioModels is using to encode images inline into HTML: https://www.w3docs.com/snippets/html/how-to-display-base64-images-in-html.html

freiburgermsu commented 2 years ago

Thank you @jonrkarr; your code worked on the first try.

The #125 PR completes the remaining TODO tasks for this issue: 1) Curation notes were scraped to the original folder. 2) Curation images were scraped to both the original and final folders. 3) The curation_notes_requests script is imported to the fix-entries script.