Closed ShadowKyogre closed 9 years ago
Thank you for the report, I will look into it.
Interesting, looks like CSSSelector instances using :contains()
are not reusable. We create selector instances when worker process is initialized and this seems to be working fine for other selectors, but not for ones with :contains()
. With the following change yugioh-wikia-com compiles without errors using your filters:
diff --git a/mwscrape2slob/__init__.py b/mwscrape2slob/__init__.py
index 1a39a92..7332540 100644
--- a/mwscrape2slob/__init__.py
+++ b/mwscrape2slob/__init__.py
@@ -82,7 +82,7 @@ NAMESPACES = {}
def process_initializer(css_selectors, interwikimap, namespaces):
logging.basicConfig()
for css_selector in css_selectors:
- SELECTORS.append(CSSSelector(css_selector))
+ SELECTORS.append(css_selector)
for item in interwikimap:
prefix = item.get('prefix')
url = item.get('url')
@@ -429,7 +429,7 @@ def convert(title, text, rtl, server, articlepath, args):
convert_get_microformat(doc)
for selector in SELECTORS:
- for item in selector(doc):
+ for item in CSSSelector(selector)(doc):
item.drop_tree()
for item in SEL_A_IPA(doc):
Fixed in 56d3d45aeabece9beca2a41404849060e919a9cf
Pastebin of the error (hit ctrl+c early so you wouldn't have to see miles of the same thing), which was run in a virtualenv dedicated to mwscrape2slob and the new format: https://ptpb.pw/QQle.txt
Possibly related issues: https://github.com/aarddict/tools/pull/39#issuecomment-74524441
How to reproduce:
mwscrape yugioh.wikia.com --site-path=/ --titles=@articles.txt
This will take a while because card images.mwscrape2slob http://127.0.0.1:5984/yugioh-wikia-com -f ${yourfilterfilenamehere}
Expected results: No errors and a *.slob file filled with the cards.
Actual results: See the pastebin.