lorey / mlscraper

🤖 Scrape data from HTML websites automatically by just providing examples
https://pypi.org/project/mlscraper/
1.31k stars 89 forks source link

Scraper not found error #38

Closed soumyabroto closed 1 year ago

soumyabroto commented 1 year ago

I am trying to scrape the Case Number from the following HTML File.

Versions:

mlscraper: pip install --pre mlscraper python: 3.9

Code:

import requests from mlscraper.html import Page from mlscraper.samples import Sample, TrainingSet from mlscraper.training import train_scraper

Fetch the page to train

HTMLFile = open("/content/PUS06037-BC2116122017-06-12 14_07_29.976088.txt", "r")

Reading the file

index = HTMLFile.read()

create a training sample

training_set = TrainingSet() index = index.replace(u'\xa0', u' ') page = Page(index) sample = Sample(page, {'Filing Date:': '06/08/1999','Case Number:': 'BC211612'}) training_set.add_sample(sample)

train the scraper with the created training set

scraper = train_scraper(training_set)

I am getting the following message:

NoScraperFoundException Traceback (most recent call last)

in () 18 19 # train the scraper with the created training set ---> 20 scraper = train_scraper(training_set) 21 22 # scrape another page /usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity) 72 f"({complexity=}, {match_combination=})" 73 ) ---> 74 raise NoScraperFoundException("did not find scraper") 75 76 NoScraperFoundException: did not find scraper [PUS06037-BC2116122017-06-12 14_07_29.txt](https://github.com/lorey/mlscraper/files/11271236/PUS06037-BC2116122017-06-12.14_07_29.txt)
lorey commented 1 year ago

Can you try with two samples instead of only one? That usually helps a lot.

soumyabroto commented 1 year ago

Using just the Case Number, I get the below error:

sample = Sample(page, {'Case Number': 'BC211612'})


ValueError:

Case Number: BC211612
PACIFICA GARDEN TOWNHOMES VS 15936 HUNSAKER INC

is not in list

lorey commented 1 year ago

I need the full stack trace please, otherwise it's only guessing

soumyabroto commented 1 year ago

Sorry, here it is:

/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity) 42 ) 43 ---> 44 sample_matches = [ 45 sorted(s.get_matches(), key=lambda m: m.span)[:100] 46 for s in training_set.item.samples

/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in (.0) 43 44 sample_matches = [ ---> 45 sorted(s.get_matches(), key=lambda m: m.span)[:100] 46 for s in training_set.item.samples 47 ]

/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in (m) 43 44 sample_matches = [ ---> 45 sorted(s.get_matches(), key=lambda m: m.span)[:100] 46 for s in training_set.item.samples 47 ]

/usr/lib/python3.9/functools.py in get(self, instance, owner) 991 val = cache.get(self.attrname, _NOT_FOUND) 992 if val is _NOT_FOUND: --> 993 val = self.func(instance) 994 try: 995 cache[self.attrname] = val

/usr/local/lib/python3.9/dist-packages/mlscraper/matches.py in span(self) 129 def span(self): 130 # add span from this root to match root --> 131 return sum( 132 m.span + get_relative_depth(m.root, self.root) 133 for m in self.match_by_key.values()

/usr/local/lib/python3.9/dist-packages/mlscraper/matches.py in (.0) 130 # add span from this root to match root 131 return sum( --> 132 m.span + get_relative_depth(m.root, self.root) 133 for m in self.match_by_key.values() 134 )

/usr/local/lib/python3.9/dist-packages/mlscraper/html.py in get_relative_depth(node, root) 179 180 # depth of root --> 181 i = node_parents.index(root.soup) 182 183 # depth of element

ValueError:

Case Number: BC211612
PACIFICA GARDEN TOWNHOMES VS 15936 HUNSAKER INC

is not in list

lorey commented 1 year ago

This is a non-obvious one. Have you tried with the latest version? Just install from git directly. Otherwise maybe start debugging and see what node_parents contains and why the sample is not in there.

I will also check, but I'm quite busy this week.

lorey commented 1 year ago

Just checked the HTML, this cannot be extracted with only CSS selectors because it's a substring of a (complex) css selector. You could try to match the full content of the

or just use ChatGPT.