suyashb95 / WiktionaryParser

A Python Wiktionary Parser
MIT License
358 stars 92 forks source link

[Norwegian] Some pages are not being scrapped properly #33

Open C0rn3j opened 6 years ago

C0rn3j commented 6 years ago

Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project \^\^


https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": [], "audio": []}}]

https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": ["IPA: /h\u0251m/"], "audio": []}}]

https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l Missing the verb definition

[
    {
        "etymology": "From Old Norse býr (“place (to camp or settle), land, property, lot; and later settlement”).\n",
        "definitions": [
            {
                "partOfSpeech": "noun",
                "text": "by m (definite singular byen, indefinite plural byer, definite plural byene)\n\ntown, city (regardless of population size or land area)\n",
                "relatedWords": [
                    {
                        "relationshipType": "derived terms",
                        "words": [
                            "bydel",
                            "byfornyelse, byfornying",
                            "bygdeby",
                            "bymessig",
                            "bystat",
                            "bystatus",
                            "drabantby",
                            "ferieby",
                            "gamleby",
                            "havneby",
                            "hjemby",
                            "landsby",
                            "Mexico by",
                            "naboby",
                            "spøkelsesby",
                            "storby"
                        ]
                    }
                ],
                "examples": []
            }
        ],
        "pronunciations": {
            "text": [],
            "audio": []
        }
    },
    {
        "etymology": "From byde, from Old Norse bjóða, from Proto-Germanic *beudaną (“to offer”), from Proto-Indo-European *bʰewdʰ- (“to wake, rise up”).\n",
        "definitions": [],
        "pronunciations": {
            "text": [],
            "audio": []
        }
    }
]

Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.

https://haste.rys.pw/raw/vevafamiwo

Another half-broken entry -

https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l

suyashb95 commented 6 years ago

Seems fixed in 39ba27422ab33e104d0f034df1e62848a5229c48

C0rn3j commented 6 years ago

Seems fixed indeed. Thank you a LOT.

Is there anywhere I can send you a few bucks to? Paypal?

suyashb95 commented 6 years ago

Appreciate it but, it's a hobby project so that's not necessary :D

C0rn3j commented 6 years ago

And your hobby project is incredibly helpful to me, so if you change your mind and I ever see a donation page/button on the main page, I'll use it ^^


Actually found one more under løsrive, it's missing the inflection part - https://en.wiktionary.org/wiki/l%C3%B8srive#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From løs +‎ rive",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "(often reflexive, with seg / oneself)\nto break away\nto detach (oneself)\nto tear oneself away (fra / from)\nto secede (fra / from)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

EDIT: And one more in Bokmål - øl - it strips the first inflection line

https://en.wiktionary.org/wiki/%C3%B8l#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From Old Norse ǫl, from Proto-Germanic *alu, from Proto-Indo-European *h₂elut- (“beer”).\n",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "øl m (definite singular ølen, indefinite plural øl, definite plural ølene) (a glass, bottle or can of beer)\n\nbeer (alcoholic drink)\na beer (in a glass, bottle or can)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /œl/",
        "Rhymes: -œl"
      ],
      "audio": []
    }
  }
]
suyashb95 commented 6 years ago

Inflections seem to be turning up properly now, although they're a part of the definition text itself

C0rn3j commented 6 years ago

Amazing, looking forwards to a new release ^^

C0rn3j commented 6 years ago

That seems to have broken more than it fixed.

konkurs in Norwegian Bokmål in 0.0.8:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs (indeclinable)\n\nbankrupt\nkonkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

and after in 0.0.91:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

heis in 0.0.91 has a duped entry

[
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  },
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\nheis\nimperative of heise\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

Here's more that broke for testing (first word of every line is what the entry is for, this is a diff)-

image

suyashb95 commented 6 years ago

Whoops, added a fix in another release

C0rn3j commented 6 years ago

Okay, that looks much better, just a few things.

My scripts operate on the assumption that the inflections are before the first line break. Am unsure if that was true for every word in 0.0.8, but it certainly was for 99.9%+ of them.

In 0.0.92 this is now not the case with bor and handful of other entries, like faksimile, while it seems it gets otherwise scrapped correctly, it adds line breaks between the two inflection lines. Is this by design and should I write some different kind of detection? It didn't use to be that way until now, think it was just a space in the other words.

image

image

Other than that it seems to have broken a single word - pantergaupe, which is now missing the inflection part.

[
  {
    "etymology": "panter +‎ gaupe",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "Iberian lynx; Lynx pardinus\n",
        "relatedWords": [
          {
            "relationshipType": "synonyms",
            "words": [
              "iberisk gaupe",
              "spansk gaupe"
            ]
          }
        ],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /pan.ter.ɡæʉ.pe/, [ˈpɑn.təɾ.ˌɡæʉ̯ː.pə]"
      ],
      "audio": []
    }
  }
]
suyashb95 commented 6 years ago

Some of the inflections are in multiple lines so they'll be parsed that way. I've gonna fix inflection parsing for other words like pantergaupe in the dev branch for now. I'm experimenting with having definitions in a list of sentences instead of one long string, let's see if that works.

C0rn3j commented 6 years ago

Ohhhh you're totally right! Never noticed nor realized this would be the problem.

image

I skimmed my definition list and apparently this was already an issue I was not handling. Your fix just made it more visible.

C0rn3j commented 6 years ago

Not sure if same problem as pantergaupe but maldivisk is missing the inflection line in the second definition(0.0.92).

https://en.wiktionary.org/wiki/maldivisk#Norwegian_Bokm%C3%A5l

image

BTW: I rewrote the detection part of my script, it seems to be working great, thanks for the fixes!

suyashb95 commented 6 years ago

Added some changes in 2ba2eea7d34d8e2ae57633210e648f6054d600ab to fix this. Also, the definition text is now a list so you may have to change your script

C0rn3j commented 6 years ago

Finally kicked myself to work on my script again, changes look awesome, thanks!

C0rn3j commented 6 years ago

Okay I only looked at my inflections output, premature celebration.

Your changes at some point seemed to have added garbage in the form of the word name to some words.

https://en.wiktionary.org/wiki/forrevet forrevet has a definition 'forrevet' which really shouldn't be there for example.

[
  {
    "etymology": "",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": [
          "forrevet (indefinite singular forrevet, definite singular and plural forrevne)",
          "alternative form of forreven",
          "forrevet",
          "neuter singular of forreven"
        ],
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

https://en.wiktionary.org/wiki/foreskrevet has the exact same issue and am sure there's a bunch of others

suyashb95 commented 6 years ago

I haven't encountered multiple subheadings under a definition yet. The subheadings usually contain inflections so the parser adds that to the list of definitions. I guess it should either not include them or separate them out from the definition list, probably in a field called word/inflections in the JSON

C0rn3j commented 6 years ago

Yeah, it should separate it, or not do that, as I can't simply filter out if word X contains definition X because some words really are that way (best in bokmål means best).

If you need more examples where this happens - støvete, uomskåret,

C0rn3j commented 5 years ago

It looks like one of the updates also broke nested definitions

https://en.wiktionary.org/wiki/v%C3%A6re_glad_i

image

They weren't exactly scrapped perfectly in the first place it seems, but now they're not scrapped at all.

image

suyashb95 commented 5 years ago

Nested definitions and examples have ambiguous formatting so figuring that out is going to take some time

C0rn3j commented 5 years ago

I've had luck with the Wiktionary contributors willing to redo old formatting and use a newer template for some snowflake definitions I ran into.

Not sure if these nested words are the case, I could ask about them, but that'd require me to go through the diff and pick them out, which right now has a lot of "garbage" I mentioned above, and it'd be a pain to go through it in this state.

image