lsiddiqsunny / API-for-Bangladeshi-Medicine

2 stars 0 forks source link

Drug Type is not available Medex/medicine.csv file #1

Open sumanengbd opened 2 months ago

sumanengbd commented 2 months ago

I saw that your work is beautiful. Could you please update the medex/medicine.csv file? It would be better to add a new column to the CSV file for the drug type, such as tablet, injection, syrup, etc.

lsiddiqsunny commented 2 months ago

I am not working on it anymore. If you have any data, please update it by giving a PR.

sumanengbd commented 1 month ago

@lsiddiqsunny

Some changes have been made to Medex's servers, and I was able to scrape the updated data. Please find the updated code below:

from bs4 import BeautifulSoup
import pandas as pd
import time

# Initialize lists to store data
medicine_names = []
drug_types = []
strengths = []
companies = []
generics = []

# Ask the user how many pages they want to scrape
total_pages = int(793)

# Base URL for scraping
base_url = 'https://medex.com.bd/brands?page={}'

# Loop through each page specified by the user
for page_num in range(1, total_pages + 1):
    print(f"Scraping page {page_num} of {total_pages}...")
    url = base_url.format(page_num)

    # Make a request to the page
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all medicine listings
        medicines = soup.find_all('a', class_='hoverable-block')

        for medicine in medicines:
            # Get the medicine name
            name = medicine.find('div', class_='col-xs-12 data-row-top').text.strip()
            medicine_names.append(name)

            # Get the drug type from the title attribute of .dosage-icon
            drug_type = medicine.find('img', class_='dosage-icon')['title']
            drug_types.append(drug_type)

            # Get the strength (inside span with class grey-ligten)
            strength = medicine.find('span', class_='grey-ligten').text.strip()
            strengths.append(strength)

            # Get the company name
            company = medicine.find('span', class_='data-row-company').text.strip()
            companies.append(company)

            # Get the generic value (inside .col-xs-12:nth-child(3))
            generic = medicine.select_one('.col-xs-12:nth-child(3)').text.strip()
            generics.append(generic)

    else:
        print(f"Failed to retrieve page {page_num}")

    # Delay to avoid overloading the server
    time.sleep(1)

# Create a dictionary for storing the scraped data
cdata = {
    "Name": medicine_names,
    "Drug Type": drug_types,
    "Strength": strengths,
    "Company": companies,
    "Generic": generics  # Generic info from .col-xs-12:nth-child(3)
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data=cdata)

# Save the DataFrame to a CSV file
df.to_csv("D:\sumanengbd.github.io\discharge\images\medicine.csv", header=True, index=False)
print(f"Data from {total_pages} pages saved to 'medicine.csv'.")