mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Error while extracting citations from the dumps #17

Open kodchi opened 5 years ago

kodchi commented 5 years ago

While extracting citations from the hewiki dumps of 2019/05/01, the following error occurs:

$ mwcites extract /mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history*.xml*.bz2 > hewiki-20190501-citations.tsv
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/bin/mwcites", line 11, in <module>
    sys.exit(main())
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/mwcites.py", line 49, in main
    module.main(sys.argv[2:])
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 58, in main
    run(dump_files, extractors)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 65, in run
    for page_id, title, rev_id, timestamp, type, id in cites:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/map.py", line 87, in map
Failed while processing dump '/mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history1.xml-p13702p18009.bz2':
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/processor.py", line 35, in run
    for out in self.process_dump(dump, path):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 94, in process_dump
    for cite in extract_cite_history(page, extractors):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 116, in extract_cite_history
    for revision in page:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/page.py", line 72, in load_revisions
    yield Revision.from_element(sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 99, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 20, in <lambda>
    'contributor': lambda e: Contributor.from_element(e),
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 40, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 14, in <lambda>
    'id': lambda e: int(e.text),
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
    re_raise(error, path)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/map.py", line 12, in re_raise

    raise error
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Failed while processing dump '/mnt/data/xmldatadumps/public/hewiki/20190501/hewiki-20190501-pages-meta-history1.xml-p6536p13701.bz2':
Traceback (most recent call last):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/processor.py", line 35, in run
    for out in self.process_dump(dump, path):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 94, in process_dump
    for cite in extract_cite_history(page, extractors):
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mwcites/utilities/extract.py", line 116, in extract_cite_history
    for revision in page:
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/page.py", line 72, in load_revisions
    yield Revision.from_element(sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 99, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/revision.py", line 20, in <lambda>
    'contributor': lambda e: Contributor.from_element(e),
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 40, in from_element
    values = consume_tags(cls.TAG_MAP, element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/util.py", line 7, in consume_tags
    value_map[tag_name] = tag_map[tag_name](sub_element)
  File "/srv/home/bmansurov/venv/mwcites/lib/python3.5/site-packages/mw/xml_dump/iteration/contributor.py", line 14, in <lambda>
    'id': lambda e: int(e.text),
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'