idio / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobby
17 stars 2 forks source link

List items being picked up as independent paragraphs #41

Open keynmol opened 7 years ago

keynmol commented 7 years ago

Example: https://simple.wikipedia.org/wiki/Human_evolution ("Species list" section)

In XML dump this looks like this:

== Species list ==
This list is in chronological order by [[genus]].

* ''[[Sahelanthropus]]''
** ''[[Sahelanthropus tchadensis]]''
* ''[[Orrorin]]''
** ''[[Orrorin tugenensis]]''
* ''[[Ardipithecus]]''
** ''[[Ardipithecus kadabba]]''
** ''[[Ardipithecus ramidus]]''
* ''[[Australopithecus]]''
** ''[[Australopithecus anamensis]]''
** ''[[Australopithecus afarensis]]''
** ''[[Australopithecus bahrelghazali]]''
** ''[[Australopithecus africanus]]''
** ''[[Australopithecus garhi]]''
...

Jsonpedia contains a very weird split with annotations being jammed together with wrong offsets:

{
      "paragraph": "Australopithecus anamensis Australopithecus afarensis Australopithecus bahrelghazali Australopithecus africanus Australopithecus garhi",
      "links": [
        {
          "id": "Australopithecus_anamensis",
          "anchor": "Australopithecus anamensis",
          "start": 0,
          "end": 26
        },
        {
          "id": "Australopithecus_afarensis",
          "anchor": "Australopithecus afarensis",
          "start": 0,
          "end": 26
        },
        {
          "id": "Australopithecus_bahrelghazali",
          "anchor": "Australopithecus bahrelghazali",
          "start": 0,
          "end": 30
        },
        {
          "id": "Australopithecus_africanus",
          "anchor": "Australopithecus africanus",
          "start": 0,
          "end": 26
        },
        {
          "id": "Australopithecus_garhi",
          "anchor": "Australopithecus garhi",
          "start": 0,
          "end": 22
        }
      ]
    }