itkach / mw2slob

A tool to convert MediaWiki content to dictionaries in slob format
GNU General Public License v3.0
19 stars 4 forks source link

XPath lower-case not defined #4

Closed ShadowKyogre closed 9 years ago

ShadowKyogre commented 9 years ago

Pastebin of the error (hit ctrl+c early so you wouldn't have to see miles of the same thing), which was run in a virtualenv dedicated to mwscrape2slob and the new format: https://ptpb.pw/QQle.txt

Possibly related issues: https://github.com/aarddict/tools/pull/39#issuecomment-74524441

How to reproduce:

  1. Grab the current list of OCG cards through the mediawiki api with this python script I whipped up. https://gist.github.com/ShadowKyogre/86881fe3e2b9ff00492b
  2. Run the following command to grab the test data: mwscrape yugioh.wikia.com --site-path=/ --titles=@articles.txt This will take a while because card images.
  3. Use the following filter set either in a new file or fed manually to the commandline
.cardtablespanrow:contains("sets") 
.cardtablespanrow:contains("Card appearances") 
tr:contains("Anime") + tr 
tr:contains("Anime") 
tr:contains("Video game statuses") + tr 
tr:contains("Video game statuses") 
table.collapsible.hslit:contains("pages")
  1. Run the following command to convert the test data to the *.slob format: mwscrape2slob http://127.0.0.1:5984/yugioh-wikia-com -f ${yourfilterfilenamehere}

Expected results: No errors and a *.slob file filled with the cards.

Actual results: See the pastebin.

itkach commented 9 years ago

Thank you for the report, I will look into it.

itkach commented 9 years ago

Interesting, looks like CSSSelector instances using :contains() are not reusable. We create selector instances when worker process is initialized and this seems to be working fine for other selectors, but not for ones with :contains(). With the following change yugioh-wikia-com compiles without errors using your filters:

diff --git a/mwscrape2slob/__init__.py b/mwscrape2slob/__init__.py
index 1a39a92..7332540 100644
--- a/mwscrape2slob/__init__.py
+++ b/mwscrape2slob/__init__.py
@@ -82,7 +82,7 @@ NAMESPACES = {}
 def process_initializer(css_selectors, interwikimap, namespaces):
     logging.basicConfig()
     for css_selector in css_selectors:
-        SELECTORS.append(CSSSelector(css_selector))
+        SELECTORS.append(css_selector)
     for item in interwikimap:
         prefix = item.get('prefix')
         url = item.get('url')
@@ -429,7 +429,7 @@ def convert(title, text, rtl, server, articlepath, args):
     convert_get_microformat(doc)

     for selector in SELECTORS:
-        for item in selector(doc):
+        for item in CSSSelector(selector)(doc):
             item.drop_tree()

     for item in SEL_A_IPA(doc):
itkach commented 9 years ago

Fixed in 56d3d45aeabece9beca2a41404849060e919a9cf