Closed jonrkarr closed 2 years ago
The <div class="ebiLayout_reducewidth"> == $0
child in the HTML of this model is not present in the requests
text from this code
from xml.etree import ElementTree as et
from bs4 import BeautifulSoup
from glob import glob
import requests
import os, re
for model_path in glob(os.path.join('final', '*')):
model_id = re.search('(?<=\\\\)(.+$)', model_path).group()
url = f'https://www.ebi.ac.uk/biomodels/{model_id}#Curation'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
with open(f'{model_id}_curation_page.txt', 'w', encoding = 'utf-8') as out:
out.write(soup.prettify())
The downloaded HTML text is attached: BIOMD0000000001_curation_page.txt.
How can the necessary div, that possesses the curation notes, be scraped as well? Does requests
have a limitation to the number of children that are scraped?
I'm not sure I understand the question.
The requests package can be used for as many URLs as you like.
soup
can also be queried for as many HTML elements as necessary. E.g., a = soup.find(id='a'); b = soup.find(id='b'); ...
.
My question is how can the <div class="ebiLayout_reducewidth"> == $0
be scraped? This div describes the area of the BioModels page that contains the curation notes, yet, it does not appear in the scraped content from requests, per the txt
file from my previous post. Why is this the case?
I will search through StackOverflow for inspiration.
I don't understand. The curator's comments are at the right side of your screen shot. You need to drill down further into the HTML to figure out the element that encapsulates the note.
I found the HTML on the website that possesses the curation notes
however, this is not found within the scrape from requests. Only the first div at this level in the HTML tree (class="row"
) is in the requests scrape; not the div (class="ebiLayout_reducewidth"
) that contains the curation notes.
The location in the scrape where the <div class="ebiLayout_reducewidth"
should be possesses text which seems to suggest that requests is unable to access the content.
This error (described here and here) seems to be resolved through a multipart argument in requests. I will investigate this workaround further.
The HTML page that you attached above displays an error. If you open this in your browser, you can clearly see the error.
Unsupported Media Type If you are using a web browser You may be using an old browser, or a configuration setting may prevent it from asking for HTML content. For the best experience, please use a recent version of Firefox or Chrome. If you continue to encounter the issue after upgrading, please file a bug report including the page you tried to access and the web browser you are using.
If you are using our API We only support JSON and XML response types. You can either
- include an Accept header in your request with the mime types you prefer, for instance: Accept:
application/json,application/xml
- indicate the preferred mime type using the 'format' request parameter:
/endpoint1?format=xml
/endpoint2?someParam=foo&format=json
When you make the request, you need to explicitly specify that the accept type should be application/html
:
requests.get(..., header={'accept': 'application/html'})
Thank you @jonrkarr, this cracked the code.
The curation notes are saved in txt
files with a naming schema of {BioModel ID}_curation_notes
. The following models do not possess curation notes, and thus no file of curation notes was created for these models:
..\temp-biomodels\final\BIOMD0000000014
..\temp-biomodels\final\BIOMD0000000018
..\temp-biomodels\final\BIOMD0000000026
..\temp-biomodels\final\BIOMD0000000029
..\temp-biomodels\final\BIOMD0000000030
..\temp-biomodels\final\BIOMD0000000034
..\temp-biomodels\final\BIOMD0000000038
..\temp-biomodels\final\BIOMD0000000041
..\temp-biomodels\final\BIOMD0000000049
..\temp-biomodels\final\BIOMD0000000070
..\temp-biomodels\final\BIOMD0000000076
..\temp-biomodels\final\BIOMD0000000095
..\temp-biomodels\final\BIOMD0000000103
@jonrkarr I receive this error InvalidURL: Failed to parse: data:image
from the requests.get...
line
for img_file in div.find_all('img'):
image = requests.get(f'http://{img_file["src"]}')
image.raw.decode_content = True
with open(os.path.join(directory, f'{model_id}_curation_image.png'), 'wb') as out:
shutil.copyfileobj(image.raw, out)
when scraping the curation images that have this URL syntax http://data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAukAAAQdCAYAAADjF9Y+ ...
. The URL is correct, since it renders when I search the URL (minus the preceding http://
) in my browser.
StackOverflow does not seem to have any pertinent posts to resolve the error (this post seems to describe the same error from the same BioModels URL syntax, but no solution is provided). Do you have suggestions?
http://data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAukAAAQdCAYAAADjF9Y+...
means that the source of the image is provided inline, rather than at an external URL. The first part after data:
indicates the internet media type. In this case its image/jpeg
. The next part (`base64') indicates the encoding.
Rather than attempting to retrieve the source from another URL, the content simply needs to be decoded and saved to a file.
This can be done roughly like this:
import base64
for img_file in div.find_all('img'):
img_format = img_file['src'].partition('/')[2].partition(';')[0]
img_data = base64,b64decode(img_file['src'].partition(',')[2])
img_filename = os.path.join(directory, f'{model_id}_curation_image.{img_format}')
with open(img_filename, 'wb') as out:
out.write(img_data)
This page explains the method that BioModels is using to encode images inline into HTML: https://www.w3docs.com/snippets/html/how-to-display-base64-images-in-html.html
Thank you @jonrkarr; your code worked on the first try.
The #125 PR completes the remaining TODO tasks for this issue:
1) Curation notes were scraped to the original
folder.
2) Curation images were scraped to both the original
and final
folders.
3) The curation_notes_requests
script is imported to the fix-entries
script.
I think it would be helpful to have this here so it can also be corrected
Todo
This should be done by
fix_*.py
scripts in this root of this repository as a templateoriginal
subdirectory