Closed soumyabroto closed 1 year ago
Can you try with two samples instead of only one? That usually helps a lot.
sample = Sample(page, {'Case Number': 'BC211612'})
ValueError:
Case Number: BC211612
PACIFICA GARDEN TOWNHOMES VS 15936 HUNSAKER INC
I need the full stack trace please, otherwise it's only guessing
/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in train_scraper(training_set, complexity) 42 ) 43 ---> 44 sample_matches = [ 45 sorted(s.get_matches(), key=lambda m: m.span)[:100] 46 for s in training_set.item.samples
/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in
/usr/local/lib/python3.9/dist-packages/mlscraper/training.py in
/usr/lib/python3.9/functools.py in get(self, instance, owner) 991 val = cache.get(self.attrname, _NOT_FOUND) 992 if val is _NOT_FOUND: --> 993 val = self.func(instance) 994 try: 995 cache[self.attrname] = val
/usr/local/lib/python3.9/dist-packages/mlscraper/matches.py in span(self) 129 def span(self): 130 # add span from this root to match root --> 131 return sum( 132 m.span + get_relative_depth(m.root, self.root) 133 for m in self.match_by_key.values()
/usr/local/lib/python3.9/dist-packages/mlscraper/matches.py in
/usr/local/lib/python3.9/dist-packages/mlscraper/html.py in get_relative_depth(node, root) 179 180 # depth of root --> 181 i = node_parents.index(root.soup) 182 183 # depth of element
ValueError:
Case Number: BC211612
PACIFICA GARDEN TOWNHOMES VS 15936 HUNSAKER INC
This is a non-obvious one. Have you tried with the latest version? Just install from git directly. Otherwise maybe start debugging and see what node_parents contains and why the sample is not in there.
I will also check, but I'm quite busy this week.
Just checked the HTML, this cannot be extracted with only CSS selectors because it's a substring of a (complex) css selector. You could try to match the full content of the
or just use ChatGPT.
I am trying to scrape the Case Number from the following HTML File.
Versions:
mlscraper: pip install --pre mlscraper python: 3.9
Code:
import requests from mlscraper.html import Page from mlscraper.samples import Sample, TrainingSet from mlscraper.training import train_scraper
Fetch the page to train
HTMLFile = open("/content/PUS06037-BC2116122017-06-12 14_07_29.976088.txt", "r")
Reading the file
index = HTMLFile.read()
create a training sample
training_set = TrainingSet() index = index.replace(u'\xa0', u' ') page = Page(index) sample = Sample(page, {'Filing Date:': '06/08/1999','Case Number:': 'BC211612'}) training_set.add_sample(sample)
train the scraper with the created training set
scraper = train_scraper(training_set)
I am getting the following message:
NoScraperFoundException Traceback (most recent call last)