Closed darthnithin closed 4 years ago
It seems as though i got around this by downloading the dump manually from https://dumps.wikimedia.org/enwiktionary/20200720/enwiktionary-20200720-pages-articles.xml.bz2 which was given in the error message. Then extracting in and then moving the .xml file into the /dumps directory
From what i've been reading on stackoverflow with my limited knowledge of java I think the problem is with the handling of URLs, perhaps somewhere in here: the error message says this
at edu.self.w2k.util.DumpUtil.downloadFile(DumpUtil.java:61)
at edu.self.w2k.util.DumpUtil.download(DumpUtil.java:50)
at edu.self.w2k.CLI.main(CLI.java:35)
so:
DumpUtil.java:61: String fileName = Paths.get(url).getFileName().toString();
Relavent stackoverflows:
openjdk bug report
stackoverflow
Just me googleing the error
Hi @Nithindanday
Thanks for reporting the issue! I will take a look at it this week-end. It may look like a Windows-only problem.
nyg
I got passed that issue by manually downloading the dumps, but I got stuck in the python script because (I think) of encoding
I have the very same issue as OP but I get another error once the dump is downloaded manually, extracted and placed in /dumps. Software: JDK 15, Apache 3.6.3. OS: Windows 10.
INFO: Executing: parse en 20200920
sept. 26, 2020 6:37:45 P.áM. edu.self.w2k.util.DumpUtil getDumpFile
SEVERE: No dumps found.
sept. 26, 2020 6:37:45 P.áM. de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition deleteParsedWiktionary
INFO: Removing parsed Wiktionary from db
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.io.File.getName()" because "dumpFile" is null
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.openDumpFile(XMLDumpParser.java:159)
at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:121)
at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:140)
at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:114)
at edu.self.w2k.util.DumpUtil.parse(DumpUtil.java:134)
at edu.self.w2k.CLI.main(CLI.java:40)
My knowledge of Java is also limited, so I'd appreciate any layman help. Thanks for this project; it's just what I was looking for.
@swaree @Nithindanday
Sorry for the delay… I've just pushed two minor corrections which should solve:
java.nio.file.InvalidPathException: Illegal char <:> at index 5
) andTo get these two corrections you need to do a git pull
both at the root of the project and in scripts/tab2opf
.
@swaree
The problem is that it can't find the file in the dumps
directory. What's the filename you have used? It should be something like enwiktionary-20200920-pages-articles.xml
.
The problem is that it can't find the file in the
dumps
directory. What's the filename you have used? It should be something likeenwiktionary-20200920-pages-articles.xml
.
That solved the problem, my filename had pages-articles-mainstream
instead. Thanks!
But I have encountered another error, which, as far as I know, implies that my language code gem-pro
is invalid:
INFO: Executing: generate gem-pro latest
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.lang.Comparable.compareTo(Object)" because "k1" is null
at java.base/java.util.TreeMap.compare(TreeMap.java:1563)
at java.base/java.util.TreeMap.addEntryToEmptyMap(TreeMap.java:768)
at java.base/java.util.TreeMap.put(TreeMap.java:777)
at java.base/java.util.TreeMap.put(TreeMap.java:534)
at java.base/java.util.TreeSet.add(TreeSet.java:255)
at java.base/java.util.Collections.addAll(Collections.java:5593)
at de.tudarmstadt.ukp.jwktl.api.filter.WiktionaryEntryFilter.setAllowedWordLanguages(WiktionaryEntryFilter.java:86)
at edu.self.w2k.util.WiktionaryUtil.generateDictionary(WiktionaryUtil.java:28)
at edu.self.w2k.CLI.main(CLI.java:45)
If I'm correct, which codes are valid? Does it have something to do with Proto-Germanic belonging to the Reconstructed namespace?
This appear to be an issue with dkpro-jwktl (the library which I use to parse Wiktionary pages). The accepted language codes are in the following file: https://github.com/dkpro/dkpro-jwktl/blob/master/src/main/resources/de/tudarmstadt/ukp/jwktl/api/util/language_codes.txt
It appear that the code for Proto-Germanic is "gem" but I think, though I'm not sure, that it also regroups other Germanic words. So you should try with "gem" and see which words are included.
Again, thanks for your answer. After trying gem
, the resulting lexicon.txt
file is completely empty, no words or any notes whatsoever.
Indeed, I think the language list of JWKTL is not in sync with the languages on Wiktionary… (gem doesn't exist on Wiktionary but gem-pro does)
I've pushed a change adding the gem-pro language but I haven't been able to test it yet because I need to reparse the dump… You can't pull the change yourself if you want to test it now, otherwise I'll keep you up to date when my parse finishes.
Okay so it's still not working. I've just noticed that reconstructed terms are special pages on Wiktionary and I believe they are not parsed by JWKTL (e.g. https://en.wiktionary.org/wiki/Reconstruction:Proto-Germanic/auk vs https://en.wiktionary.org/wiki/auk).
I've looked into it a bit more and, indeed, JWKTL does not parse pages of reconstructed terms. If I change the JWKTL code to do so, then I get errors because some page titles are conflicting and the parse fails. I'll try to open an issue at JWKTL but the project doesn't appear to be very active anymore…
Oh, what a shame then. Thanks for looking into it. No need to bother any farther, I don't want to be a burden, and asking for fixing/updating someone else's job is an unreasonable request . Unless the OP wants to comment anything, I'll be OK considering this issue as closed. Again, the help was appreciated. Cheers, stay safe.
@swaree
Actually the conflict issue was in Wiktionary's dump. Three reconstructed terms were not named properly. More info here: https://en.wiktionary.org/wiki/Wiktionary:Tea_room/2020/October#Reconstructed_terms:_ālas,_*menoᐧtayi,_menoᐧtayi
Now that these three terms have been deleted from Wiktionary, I was able (after modifying the dump) to properly parse the dump and generate the lexicon.txt file.
Proto-Germanic/frijaz <ol><li><span>free</span></li></ol>
Proto-Germanic/hrōþiz <ol><li><span>praise, fame, glory, renown.</span></li></ol>
Proto-Germanic/gudą <ol><li><span>invoked one</span></li><li><span>{{topics|gem-pro|Religion}} god, deity</span></li></ol>
Proto-Germanic/haglaz <ol><li><span>hail (the precipitation)</span></li><li><span>(Runic alphabet) name of the H-rune (ᚺ, ᚻ)</span></li></ol>
Proto-Germanic/dagaz <ol><li><span>day</span><ul><li>{{syn|gem-pro|*tīnaz}}</li></ul></li><li><span>(Runic alphabet) name of the D-rune (ᛞ)</span></li></ol>
Proto-Germanic/wulfaz <ol><li><span>{{topics|gem-pro|Canids}} wolf</span></li></ol>
…
I may need to remove the "Proto-Germanic/" prefix for the dictionary to be usable on the Kindle but this can be done easily. Let's wait for the next Wiktionary dump to be published (October 20th, if I'm not mistaking).
Cool, thanks! Glad to hear it wasn't an issue of your program. And in exchange, we got three Wiktionary entries improved ^^. I'll try and share the result with you on that date.
Great, I will try it too. In the mean time I'm closing this issue, but if you have another problem then just create a new issue :).
When I try to run "
java -jar target/wiktionary-to-kindle-1.0.0.jar download en latest
" it errors and says:I am running in git bash, but neither cmd nor powershell work either. I'm on Apache Maven 3.6.3 and java version "14.0.1" on Windows 10. I don't really see why this is doing that, I would like some help on the issue, Thanks,