lschmelzeisen / wikidated

Apache License 2.0
15 stars 0 forks source link

Error in build_wikidated_v1_0.py #1

Open Woffee opened 1 year ago

Woffee commented 1 year ago

Dear author. I tried running the build_wikidated_v1_0.py script, but I encountered the following error. Could you help me check what's going wrong?

$ python build_wikidated_v1_0.py 
2023-09-04 11:08:23,462 E Exception occurred.
Traceback (most recent call last):
  File "build_wikidated_v1_0.py", line 44, in <module>
    _main()
  File "build_wikidated_v1_0.py", line 36, in _main
    wikidata_dump = wikidated_manager.wikidata_dump(date(year=2023, month=9, day=1))
  File "/data/wikidated/src/wikidated/wikidated_manager.py", line 42, in wikidata_dump
    return WikidataDump(self.dump_dir, version=version, mirror=mirror)
  File "/data/wikidated/src/wikidated/wikidata/wikidata_dump.py", line 61, in __init__
    self._dump_dir, self.version, self.mirror
  File "/data/wikidated/src/wikidated/wikidata/wikidata_dump.py", line 160, in load
    dump_status = _WikidataDumpStatus.parse_file(path)
  File "pydantic/main.py", line 569, in pydantic.main.BaseModel.parse_file
  File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 988 validation errors for _WikidataDumpStatus
jobs -> xmlpagelogsdumprecombine -> updated
  time data '' does not match format '%Y-%m-%d %H:%M:%S' (type=value_error)
jobs -> xmlpagelogsdumprecombine -> files -> wikidatawiki-20230901-pages-logging.xml.gz -> size
  field required (type=value_error.missing)
jobs -> xmlpagelogsdumprecombine -> files -> wikidatawiki-20230901-pages-logging.xml.gz -> url
  field required (type=value_error.missing)
jobs -> xmlpagelogsdumprecombine -> files -> wikidatawiki-20230901-pages-logging.xml.gz -> md5
  field required (type=value_error.missing)
jobs -> xmlpagelogsdumprecombine -> files -> wikidatawiki-20230901-pages-logging.xml.gz -> sha1
  field required (type=value_error.missing)
...

By the way, I made the following two modifications:

  1. Since Apache no longer provides Maven 3.8.4, I updated the Maven version to 3.8.8.
  2. Since Wikipedia no longer provides dumps for older versions, I set the date to 2023-09-01.
sarda-devesh commented 7 months ago

@Woffee I was running into similar issue and through some debugging I was able to find the cause of the error: Some of fields are missing in the downloaded dumpstatus.json but the PydanticModel makes the assumption that all fields are present. I was able to fix it by updating the following sections of code in wikidata_dump.py:

class _WikidataDumpStatusFile(PydanticModel):
    size: int = 0
    url: str = ""
    md5: str = ""
    sha1: str = ""

class _WikidataDumpStatusJob(PydanticModel):
    status: str
    updated: datetime
    files: Mapping[str, _WikidataDumpStatusFile] = None

    @validator("updated", pre=True)
    def _parse_datetime(cls, value: str) -> datetime:  # noqa: N805
        value = value.strip()
        if len(value) == 0:
            return datetime.min

        return datetime.strptime(value, "%Y-%m-%d %H:%M:%S")

class _WikidataDumpStatus(PydanticModel):
    jobs: Mapping[str, _WikidataDumpStatusJob]
    version: str

    @classmethod
    def load(cls, dump_dir: Path, version: date, mirror: str) -> _WikidataDumpStatus:
        path = dump_dir / f"wikidatawiki-{version:%4Y%2m%2d}-dumpstatus.json"
        print("Loading data from path", path)
        if not path.exists():
            url = f"{mirror}/wikidatawiki/{version:%4Y%2m%2d}/dumpstatus.json"
            _LOGGER.debug(f"Downloading Wikidata dump status from '{url}'.")

            response = requests.get(url)
            response.raise_for_status()
            path.parent.mkdir(exist_ok=True, parents=True)
            with path.open("w", encoding="UTF-8") as fd:
                fd.write(json.dumps(response.json(), indent=2) + "\n")

            _LOGGER.debug("Done downloading Wikidata dump status.")

        dump_status = _WikidataDumpStatus.parse_file(path)
        for job_name, job in dump_status.jobs.items():
            if job.status != "done":
                path.unlink()
                raise Exception(f"Job '{job_name}' is not 'done', but '{job.status}'.")

        return dump_status

Note that you still might get an error that the job status is not done because the Wikidata dump hasn't completed creating the dump. Thus I would recommend using an older dump that is completed