k1m0ch1 / BiBiT

the yogyaonline.co.id, alfacart and klikindomaret scrapper
12 stars 2 forks source link

Improve the data without store the duplicate data, only store the "changed" data by price and date #17

Open k1m0ch1 opened 5 months ago

k1m0ch1 commented 5 months ago

Problem: The data is duplicated and makes the size of database is HUGE, with current DB size is 2.4GB

you can see it from here every data always have like 6 times more every day with query select * from prices where items_id='1' AND created_at LIKE '2024-01-15%' image

the prices have the 7 million record

now lets break up by select the data from around 2024-01-15, the query tooks 700millisecond with result around 77290 record

if I do group by the data with items_id the query tooks 50second with record around 16438

this mean the data is grow 470% from original size, this is why the DB is HUGE

This is the common script to store the data

checkIdItem = db.select(TABLE='items', SELECT='id', WHERE=(db['items']['sku'] == item['sku']) | (db['items']['name'] == item['productName']))
            now = datetime.now(pytz.timezone("Asia/Jakarta"))

            if len(checkIdItem) > 0:
                idItem = checkIdItem[0][0]
                db["prices"].insert(shortuuid.uuid(), idItem, item['finalPrice'], "", now.strftime("%Y-%m-%d %H:%M:%S"))
                db["discounts"].insert(shortuuid.uuid(), idItem, item['finalPrice'], item['basePrice'], item['discountPercent'], "", now.strftime("%Y-%m-%d %H:%M:%S"))
            else:
                db["items"].insert(item['productId'], item['sku'], item['productName'], categoryData['currentCategoryName'], item['image'], f"https://alfagift.id/p/{item['productId']}", 'alfagift', now.strftime("%Y-%m-%d %H:%M:%S"))

the problem with this script it will ALWAYS save a new record,

the challenge is if I select and join the quer is it going to be longer ?

if checkIdItem is O1 to resolve the pro

checklist

k1m0ch1 commented 5 months ago

trying this query in local

SELECT
    items.id, 
    items.sku, 
    items.name, 
    prices.price, 
    prices.created_at
FROM
    items
    INNER JOIN
    prices
    ON 
        items.id = prices.items_id
WHERE
    items.id = '1' AND
    prices.created_at LIKE '2024-03-10%'

it tooks around 6 millisecond

same with this query

SELECT
    items.id, 
    items.sku, 
    items.name, 
    discounts.discount_price, 
    discounts.original_price, 
    discounts.created_at
FROM
    items
    INNER JOIN
    discounts
    ON 
        items.id = discounts.items_id
WHERE
    items.id = '1' AND
    discounts.created_at LIKE '2024-03-10%'

it tooks around 5 millisecond, while this query also tooks 3 millisecond

select id from items where items.sku = 'A09130001976' and items.name = 'Sasa Gourmet Powder MSG 250 g'

so the process will added around 11 millisecond, this mean every item will access sqlite with 14 or 15 millisecond, now if the record of items is 33828 this mean 33828 x 15 = 507420 millisecond, 507420 millisecond = 507 second

if usually the process tooks 20 minutes for 3 millisecond, the process might tooks longer twice, around 40 minutes each crawler.

k1m0ch1 commented 5 months ago

image yogya online working like this

here is for alfagift image

the scrapper is longer 10 minutes from normal image

k1m0ch1 commented 5 months ago

dope 🔥 image

k1m0ch1 commented 5 months ago

The actual

alfagift tooks about an hour 4 minutes image

k1m0ch1 commented 5 months ago

image yogya online around less than an hour, wtf awkawk

k1m0ch1 commented 5 months ago

24

k1m0ch1 commented 5 months ago

image anjir 2 jam