Open sumanengbd opened 2 months ago
I am not working on it anymore. If you have any data, please update it by giving a PR.
@lsiddiqsunny
Some changes have been made to Medex's servers, and I was able to scrape the updated data. Please find the updated code below:
from bs4 import BeautifulSoup
import pandas as pd
import time
# Initialize lists to store data
medicine_names = []
drug_types = []
strengths = []
companies = []
generics = []
# Ask the user how many pages they want to scrape
total_pages = int(793)
# Base URL for scraping
base_url = 'https://medex.com.bd/brands?page={}'
# Loop through each page specified by the user
for page_num in range(1, total_pages + 1):
print(f"Scraping page {page_num} of {total_pages}...")
url = base_url.format(page_num)
# Make a request to the page
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
# Find all medicine listings
medicines = soup.find_all('a', class_='hoverable-block')
for medicine in medicines:
# Get the medicine name
name = medicine.find('div', class_='col-xs-12 data-row-top').text.strip()
medicine_names.append(name)
# Get the drug type from the title attribute of .dosage-icon
drug_type = medicine.find('img', class_='dosage-icon')['title']
drug_types.append(drug_type)
# Get the strength (inside span with class grey-ligten)
strength = medicine.find('span', class_='grey-ligten').text.strip()
strengths.append(strength)
# Get the company name
company = medicine.find('span', class_='data-row-company').text.strip()
companies.append(company)
# Get the generic value (inside .col-xs-12:nth-child(3))
generic = medicine.select_one('.col-xs-12:nth-child(3)').text.strip()
generics.append(generic)
else:
print(f"Failed to retrieve page {page_num}")
# Delay to avoid overloading the server
time.sleep(1)
# Create a dictionary for storing the scraped data
cdata = {
"Name": medicine_names,
"Drug Type": drug_types,
"Strength": strengths,
"Company": companies,
"Generic": generics # Generic info from .col-xs-12:nth-child(3)
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data=cdata)
# Save the DataFrame to a CSV file
df.to_csv("D:\sumanengbd.github.io\discharge\images\medicine.csv", header=True, index=False)
print(f"Data from {total_pages} pages saved to 'medicine.csv'.")
I saw that your work is beautiful. Could you please update the medex/medicine.csv file? It would be better to add a new column to the CSV file for the drug type, such as tablet, injection, syrup, etc.