Open C0rn3j opened 6 years ago
Seems fixed in 39ba27422ab33e104d0f034df1e62848a5229c48
Seems fixed indeed. Thank you a LOT.
Is there anywhere I can send you a few bucks to? Paypal?
Appreciate it but, it's a hobby project so that's not necessary :D
And your hobby project is incredibly helpful to me, so if you change your mind and I ever see a donation page/button on the main page, I'll use it ^^
Actually found one more under løsrive, it's missing the inflection part - https://en.wiktionary.org/wiki/l%C3%B8srive#Norwegian_Bokm%C3%A5l
[
{
"etymology": "From løs + rive",
"definitions": [
{
"partOfSpeech": "verb",
"text": "(often reflexive, with seg / oneself)\nto break away\nto detach (oneself)\nto tear oneself away (fra / from)\nto secede (fra / from)\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
}
]
EDIT: And one more in Bokmål - øl - it strips the first inflection line
https://en.wiktionary.org/wiki/%C3%B8l#Norwegian_Bokm%C3%A5l
[
{
"etymology": "From Old Norse ǫl, from Proto-Germanic *alu, from Proto-Indo-European *h₂elut- (“beer”).\n",
"definitions": [
{
"partOfSpeech": "noun",
"text": "øl m (definite singular ølen, indefinite plural øl, definite plural ølene) (a glass, bottle or can of beer)\n\nbeer (alcoholic drink)\na beer (in a glass, bottle or can)\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [
"IPA: /œl/",
"Rhymes: -œl"
],
"audio": []
}
}
]
Inflections seem to be turning up properly now, although they're a part of the definition text itself
Amazing, looking forwards to a new release ^^
That seems to have broken more than it fixed.
konkurs
in Norwegian Bokmål in 0.0.8:
[
{
"etymology": "From Latin concursus",
"definitions": [
{
"partOfSpeech": "adjective",
"text": "konkurs (indeclinable)\n\nbankrupt\n",
"relatedWords": [],
"examples": [
"gå konkurs - go bankrupt"
]
},
{
"partOfSpeech": "noun",
"text": "konkurs (indeclinable)\n\nbankrupt\nkonkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
}
]
and after in 0.0.91:
[
{
"etymology": "From Latin concursus",
"definitions": [
{
"partOfSpeech": "adjective",
"text": "konkurs (indeclinable)\n\nbankrupt\n",
"relatedWords": [],
"examples": [
"gå konkurs - go bankrupt"
]
},
{
"partOfSpeech": "noun",
"text": "konkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
}
]
heis
in 0.0.91 has a duped entry
[
{
"etymology": "From the verb heise",
"definitions": [
{
"partOfSpeech": "noun",
"text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
},
{
"etymology": "From the verb heise",
"definitions": [
{
"partOfSpeech": "verb",
"text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\nheis\nimperative of heise\n",
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
}
]
Here's more that broke for testing (first word of every line is what the entry is for, this is a diff)-
Whoops, added a fix in another release
Okay, that looks much better, just a few things.
My scripts operate on the assumption that the inflections are before the first line break. Am unsure if that was true for every word in 0.0.8, but it certainly was for 99.9%+ of them.
In 0.0.92 this is now not the case with bor
and handful of other entries, like faksimile
, while it seems it gets otherwise scrapped correctly, it adds line breaks between the two inflection lines. Is this by design and should I write some different kind of detection? It didn't use to be that way until now, think it was just a space in the other words.
Other than that it seems to have broken a single word - pantergaupe
, which is now missing the inflection part.
[
{
"etymology": "panter + gaupe",
"definitions": [
{
"partOfSpeech": "noun",
"text": "Iberian lynx; Lynx pardinus\n",
"relatedWords": [
{
"relationshipType": "synonyms",
"words": [
"iberisk gaupe",
"spansk gaupe"
]
}
],
"examples": []
}
],
"pronunciations": {
"text": [
"IPA: /pan.ter.ɡæʉ.pe/, [ˈpɑn.təɾ.ˌɡæʉ̯ː.pə]"
],
"audio": []
}
}
]
Some of the inflections are in multiple lines so they'll be parsed that way. I've gonna fix inflection parsing for other words like pantergaupe
in the dev branch for now. I'm experimenting with having definitions in a list of sentences instead of one long string, let's see if that works.
Ohhhh you're totally right! Never noticed nor realized this would be the problem.
I skimmed my definition list and apparently this was already an issue I was not handling. Your fix just made it more visible.
Not sure if same problem as pantergaupe
but maldivisk
is missing the inflection line in the second definition(0.0.92).
https://en.wiktionary.org/wiki/maldivisk#Norwegian_Bokm%C3%A5l
BTW: I rewrote the detection part of my script, it seems to be working great, thanks for the fixes!
Added some changes in 2ba2eea7d34d8e2ae57633210e648f6054d600ab to fix this. Also, the definition text is now a list so you may have to change your script
Finally kicked myself to work on my script again, changes look awesome, thanks!
Okay I only looked at my inflections output, premature celebration.
Your changes at some point seemed to have added garbage in the form of the word name to some words.
https://en.wiktionary.org/wiki/forrevet forrevet has a definition 'forrevet' which really shouldn't be there for example.
[
{
"etymology": "",
"definitions": [
{
"partOfSpeech": "adjective",
"text": [
"forrevet (indefinite singular forrevet, definite singular and plural forrevne)",
"alternative form of forreven",
"forrevet",
"neuter singular of forreven"
],
"relatedWords": [],
"examples": []
}
],
"pronunciations": {
"text": [],
"audio": []
}
}
]
https://en.wiktionary.org/wiki/foreskrevet has the exact same issue and am sure there's a bunch of others
I haven't encountered multiple subheadings under a definition yet. The subheadings usually contain inflections so the parser adds that to the list of definitions. I guess it should either not include them or separate them out from the definition list, probably in a field called word
/inflections
in the JSON
Yeah, it should separate it, or not do that, as I can't simply filter out if word X contains definition X because some words really are that way (best in bokmål means best).
If you need more examples where this happens - støvete, uomskåret,
It looks like one of the updates also broke nested definitions
https://en.wiktionary.org/wiki/v%C3%A6re_glad_i
They weren't exactly scrapped perfectly in the first place it seems, but now they're not scrapped at all.
Nested definitions and examples have ambiguous formatting so figuring that out is going to take some time
I've had luck with the Wiktionary contributors willing to redo old formatting and use a newer template for some snowflake definitions I ran into.
Not sure if these nested words are the case, I could ask about them, but that'd require me to go through the diff and pick them out, which right now has a lot of "garbage" I mentioned above, and it'd be a pain to go through it in this state.
Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project \^\^
https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l Missing completely
https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l Missing completely
https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l Missing the verb definition
Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.
https://haste.rys.pw/raw/vevafamiwo
Another half-broken entry -
https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l