lorey / mlscraper

馃 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.31k stars 89 forks source link

Extreme RAM usage #39

Closed BitGeek29 closed 1 year ago

BitGeek29 commented 1 year ago

On training uses infinite RAM, program is Killed by the system during high ram usage, even crashed my google colab with 12gb ram.

mlscraper==1.0.0rc3

lorey commented 1 year ago

I cannot possibly guess without the code.

BitGeek29 commented 1 year ago

hi lorey, here is the code

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
# einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
# resp = requests.get(einstein_url)
# assert resp.status_code == 200
with open('/content/linkedin_exp_2.html','r') as f:
  html_content = f.read()
# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(html_content)
sample = Sample(page, {
'Experience_title':"SDA",
'Experience_type':"Saama 路 Full-time",
'Experience_time_period':"Jan 2021 - Sep 2021 路 9 mos",
'Experience_location':"Maharashtra, India",
'Experience_description':"working as senior data analyst for Data ingestion into snowflake ,help engineer in understanding data and connectivity",
'Experience_skills':"Agile Methodologies 路 Linux 路 Amazon S3 路 Snowflake"
})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
# resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(html_content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

linkedin_exp_2.html.txt

lorey commented 1 year ago

Thanks, that help.

Generating statistically valid rules is virtually impossible with one sample. Try adding several samples.

BitGeek29 commented 1 year ago
sample = Sample(page, {
'Experience_title':"SDA",
'Experience_type':"Saama 路 Full-time",
'Experience_time_period':"Jan 2021 - Sep 2021 路 9 mos",
'Experience_location':"Maharashtra, India",
'Experience_description':"working as senior data analyst for Data ingestion into snowflake ,help engineer in understanding data and connectivity",
'Experience_skills':"Agile Methodologies 路 Linux 路 Amazon S3 路 Snowflake",
'Experience_title':"Senior consultant,RDC",
'Experience_type':"PwC India 路 Full-time",
'Experience_time_period':"Dec 2019 - Oct 2020 路 11 mos",
'Experience_location':"Kolkata Area, India",
'Experience_description':"worked as Senior Consultant",
'Experience_skills':"Linux 路 Microsoft Azure",
'Experience_title':"Senior Analyst",
'Experience_type':"Barclays",
'Experience_time_period':"Jun 2017 - Nov 2019 路 2 yrs 6 mos",
'Experience_location':"Pune Area, India",
'Experience_description':"working as an Senior analyst",
'Experience_skills':"Linux",
})

Screenshot from 2023-04-30 23-11-39d Screenshot from 2023-04-30 23-12-51

entrptaher commented 1 year ago

I got same issue with a sample data. I just told it to train using github issues page and then it used almost 20GB ram before crashing without any result. Not sure how to debug this.

image

The same file worked with autoscraper.

scraper = AutoScraper()
scraper.load('github')

wanted_dict = {
    "title": ["Possible to to try to extract main article from a page?"],
    "meta": ['/vzeazy'],
}

html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')
lorey commented 1 year ago

I'm just replying to you @BitGeek29, not sure if you're connected with this other guy.

You did not add the second example as I asked you to. Please do that first as it is required to derive non-trivial CSS rules. So you have to do training.add_sample() for two different pages.

lorey commented 1 year ago

@entrptaher I told you what to do in your issue. Your highly welcome to comment there and get help, but please do not post here as this just adds confustion and might be completely unrelated.

entrptaher commented 1 year ago

:/ how is it unrelated? I actually commented on this issue first because I had same issue with high memory usage, but then realised this was closed.

But I understand, and will continue there.