Closed BitGeek29 closed 1 year ago
I cannot possibly guess without the code.
hi lorey, here is the code
import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper
# fetch the page to train
# einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
# resp = requests.get(einstein_url)
# assert resp.status_code == 200
with open('/content/linkedin_exp_2.html','r') as f:
html_content = f.read()
# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(html_content)
sample = Sample(page, {
'Experience_title':"SDA",
'Experience_type':"Saama 路 Full-time",
'Experience_time_period':"Jan 2021 - Sep 2021 路 9 mos",
'Experience_location':"Maharashtra, India",
'Experience_description':"working as senior data analyst for Data ingestion into snowflake ,help engineer in understanding data and connectivity",
'Experience_skills':"Agile Methodologies 路 Linux 路 Amazon S3 路 Snowflake"
})
training_set.add_sample(sample)
# train the scraper with the created training set
scraper = train_scraper(training_set)
# scrape another page
# resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(html_content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}
Thanks, that help.
Generating statistically valid rules is virtually impossible with one sample. Try adding several samples.
sample = Sample(page, {
'Experience_title':"SDA",
'Experience_type':"Saama 路 Full-time",
'Experience_time_period':"Jan 2021 - Sep 2021 路 9 mos",
'Experience_location':"Maharashtra, India",
'Experience_description':"working as senior data analyst for Data ingestion into snowflake ,help engineer in understanding data and connectivity",
'Experience_skills':"Agile Methodologies 路 Linux 路 Amazon S3 路 Snowflake",
'Experience_title':"Senior consultant,RDC",
'Experience_type':"PwC India 路 Full-time",
'Experience_time_period':"Dec 2019 - Oct 2020 路 11 mos",
'Experience_location':"Kolkata Area, India",
'Experience_description':"worked as Senior Consultant",
'Experience_skills':"Linux 路 Microsoft Azure",
'Experience_title':"Senior Analyst",
'Experience_type':"Barclays",
'Experience_time_period':"Jun 2017 - Nov 2019 路 2 yrs 6 mos",
'Experience_location':"Pune Area, India",
'Experience_description':"working as an Senior analyst",
'Experience_skills':"Linux",
})
I got same issue with a sample data. I just told it to train using github issues page and then it used almost 20GB ram before crashing without any result. Not sure how to debug this.
The same file worked with autoscraper.
scraper = AutoScraper()
scraper.load('github')
wanted_dict = {
"title": ["Possible to to try to extract main article from a page?"],
"meta": ['/vzeazy'],
}
html_file = open('sample/train.html', 'r', encoding='utf-8')
source_code = html_file.read()
result = scraper.build(html=source_code, wanted_dict=wanted_dict)
scraper.save('github')
I'm just replying to you @BitGeek29, not sure if you're connected with this other guy.
You did not add the second example as I asked you to. Please do that first as it is required to derive non-trivial CSS rules. So you have to do training.add_sample()
for two different pages.
@entrptaher I told you what to do in your issue. Your highly welcome to comment there and get help, but please do not post here as this just adds confustion and might be completely unrelated.
:/ how is it unrelated? I actually commented on this issue first because I had same issue with high memory usage, but then realised this was closed.
But I understand, and will continue there.
On training uses infinite RAM, program is Killed by the system during high ram usage, even crashed my google colab with 12gb ram.
mlscraper==1.0.0rc3