nyg / wiktionary-to-kindle

Converts Wiktionary entries into a MOBI dictionary for Kindle, works but will not render templates making it useless. Ideally, use Wiktionary HTML dumps.
Other
32 stars 8 forks source link

Not able to download the dump... #4

Closed darthnithin closed 4 years ago

darthnithin commented 4 years ago

When I try to run "java -jar target/wiktionary-to-kindle-1.0.0.jar download en latest" it errors and says:

$ java -jar target/wiktionary-to-kindle-1.0.0.jar download en latest
Jul 22, 2020 4:40:22 PM edu.self.w2k.CLI main
INFO: Executing: download en latest
Exception in thread "main" java.nio.file.InvalidPathException: Illegal char <:> at index 5: https://dumps.wikimedia.org/enwiktionary/20200720/enwiktionary-20200720-pages-articles.xml.bz2
        at java.base/sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
        at java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
        at java.base/sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
        at java.base/sun.nio.fs.WindowsPath.parse(WindowsPath.java:92)
        at java.base/sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:229)
        at java.base/java.nio.file.Path.of(Path.java:147)
        at java.base/java.nio.file.Paths.get(Paths.java:69)
        at edu.self.w2k.util.DumpUtil.downloadFile(DumpUtil.java:61)
        at edu.self.w2k.util.DumpUtil.download(DumpUtil.java:50)
        at edu.self.w2k.CLI.main(CLI.java:35)

I am running in git bash, but neither cmd nor powershell work either. I'm on Apache Maven 3.6.3 and java version "14.0.1" on Windows 10. I don't really see why this is doing that, I would like some help on the issue, Thanks,

darthnithin commented 4 years ago

It seems as though i got around this by downloading the dump manually from https://dumps.wikimedia.org/enwiktionary/20200720/enwiktionary-20200720-pages-articles.xml.bz2 which was given in the error message. Then extracting in and then moving the .xml file into the /dumps directory

darthnithin commented 4 years ago

From what i've been reading on stackoverflow with my limited knowledge of java I think the problem is with the handling of URLs, perhaps somewhere in here: the error message says this

        at edu.self.w2k.util.DumpUtil.downloadFile(DumpUtil.java:61)
        at edu.self.w2k.util.DumpUtil.download(DumpUtil.java:50)
        at edu.self.w2k.CLI.main(CLI.java:35)

so: DumpUtil.java:61: String fileName = Paths.get(url).getFileName().toString(); Relavent stackoverflows: openjdk bug report stackoverflow Just me googleing the error

nyg commented 4 years ago

Hi @Nithindanday

Thanks for reporting the issue! I will take a look at it this week-end. It may look like a Windows-only problem.

nyg

darthnithin commented 4 years ago

I got passed that issue by manually downloading the dumps, but I got stuck in the python script because (I think) of encoding

swaree commented 4 years ago

I have the very same issue as OP but I get another error once the dump is downloaded manually, extracted and placed in /dumps. Software: JDK 15, Apache 3.6.3. OS: Windows 10.

INFO: Executing: parse en 20200920
sept. 26, 2020 6:37:45 P.áM. edu.self.w2k.util.DumpUtil getDumpFile
SEVERE: No dumps found.
sept. 26, 2020 6:37:45 P.áM. de.tudarmstadt.ukp.jwktl.api.entry.BerkeleyDBWiktionaryEdition deleteParsedWiktionary
INFO: Removing parsed Wiktionary from db
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.io.File.getName()" because "dumpFile" is null
        at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.openDumpFile(XMLDumpParser.java:159)
        at de.tudarmstadt.ukp.jwktl.parser.XMLDumpParser.parse(XMLDumpParser.java:121)
        at de.tudarmstadt.ukp.jwktl.parser.WiktionaryDumpParser.parse(WiktionaryDumpParser.java:78)
        at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:140)
        at de.tudarmstadt.ukp.jwktl.JWKTL.parseWiktionaryDump(JWKTL.java:114)
        at edu.self.w2k.util.DumpUtil.parse(DumpUtil.java:134)
        at edu.self.w2k.CLI.main(CLI.java:40)

My knowledge of Java is also limited, so I'd appreciate any layman help. Thanks for this project; it's just what I was looking for.

nyg commented 4 years ago

@swaree @Nithindanday

Sorry for the delay… I've just pushed two minor corrections which should solve:

  1. the download issue on Windows (java.nio.file.InvalidPathException: Illegal char <:> at index 5) and
  2. the encoding issue with the Python script.

To get these two corrections you need to do a git pull both at the root of the project and in scripts/tab2opf.

@swaree

The problem is that it can't find the file in the dumps directory. What's the filename you have used? It should be something like enwiktionary-20200920-pages-articles.xml.

swaree commented 4 years ago

The problem is that it can't find the file in the dumps directory. What's the filename you have used? It should be something like enwiktionary-20200920-pages-articles.xml.

That solved the problem, my filename had pages-articles-mainstream instead. Thanks! But I have encountered another error, which, as far as I know, implies that my language code gem-pro is invalid:

INFO: Executing: generate gem-pro latest
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "java.lang.Comparable.compareTo(Object)" because "k1" is null
        at java.base/java.util.TreeMap.compare(TreeMap.java:1563)
        at java.base/java.util.TreeMap.addEntryToEmptyMap(TreeMap.java:768)
        at java.base/java.util.TreeMap.put(TreeMap.java:777)
        at java.base/java.util.TreeMap.put(TreeMap.java:534)
        at java.base/java.util.TreeSet.add(TreeSet.java:255)
        at java.base/java.util.Collections.addAll(Collections.java:5593)
        at de.tudarmstadt.ukp.jwktl.api.filter.WiktionaryEntryFilter.setAllowedWordLanguages(WiktionaryEntryFilter.java:86)
        at edu.self.w2k.util.WiktionaryUtil.generateDictionary(WiktionaryUtil.java:28)
        at edu.self.w2k.CLI.main(CLI.java:45)

If I'm correct, which codes are valid? Does it have something to do with Proto-Germanic belonging to the Reconstructed namespace?

nyg commented 4 years ago

This appear to be an issue with dkpro-jwktl (the library which I use to parse Wiktionary pages). The accepted language codes are in the following file: https://github.com/dkpro/dkpro-jwktl/blob/master/src/main/resources/de/tudarmstadt/ukp/jwktl/api/util/language_codes.txt

It appear that the code for Proto-Germanic is "gem" but I think, though I'm not sure, that it also regroups other Germanic words. So you should try with "gem" and see which words are included.

swaree commented 4 years ago

Again, thanks for your answer. After trying gem, the resulting lexicon.txt file is completely empty, no words or any notes whatsoever.

nyg commented 4 years ago

Indeed, I think the language list of JWKTL is not in sync with the languages on Wiktionary… (gem doesn't exist on Wiktionary but gem-pro does)

I've pushed a change adding the gem-pro language but I haven't been able to test it yet because I need to reparse the dump… You can't pull the change yourself if you want to test it now, otherwise I'll keep you up to date when my parse finishes.

nyg commented 4 years ago

Okay so it's still not working. I've just noticed that reconstructed terms are special pages on Wiktionary and I believe they are not parsed by JWKTL (e.g. https://en.wiktionary.org/wiki/Reconstruction:Proto-Germanic/auk vs https://en.wiktionary.org/wiki/auk).

nyg commented 4 years ago

I've looked into it a bit more and, indeed, JWKTL does not parse pages of reconstructed terms. If I change the JWKTL code to do so, then I get errors because some page titles are conflicting and the parse fails. I'll try to open an issue at JWKTL but the project doesn't appear to be very active anymore…

swaree commented 4 years ago

Oh, what a shame then. Thanks for looking into it. No need to bother any farther, I don't want to be a burden, and asking for fixing/updating someone else's job is an unreasonable request . Unless the OP wants to comment anything, I'll be OK considering this issue as closed. Again, the help was appreciated. Cheers, stay safe.

nyg commented 4 years ago

@swaree

Actually the conflict issue was in Wiktionary's dump. Three reconstructed terms were not named properly. More info here: https://en.wiktionary.org/wiki/Wiktionary:Tea_room/2020/October#Reconstructed_terms:_ālas,_*menoᐧtayi,_menoᐧtayi

Now that these three terms have been deleted from Wiktionary, I was able (after modifying the dump) to properly parse the dump and generate the lexicon.txt file.

Proto-Germanic/frijaz   <ol><li><span>free</span></li></ol>
Proto-Germanic/hrōþiz   <ol><li><span>praise, fame, glory, renown.</span></li></ol>
Proto-Germanic/gudą <ol><li><span>invoked one</span></li><li><span>{{topics|gem-pro|Religion}} god, deity</span></li></ol>
Proto-Germanic/haglaz   <ol><li><span>hail (the precipitation)</span></li><li><span>(Runic alphabet) name of the H-rune (ᚺ, ᚻ)</span></li></ol>
Proto-Germanic/dagaz    <ol><li><span>day</span><ul><li>{{syn|gem-pro|*tīnaz}}</li></ul></li><li><span>(Runic alphabet) name of the D-rune (ᛞ)</span></li></ol>
Proto-Germanic/wulfaz   <ol><li><span>{{topics|gem-pro|Canids}} wolf</span></li></ol>
…

I may need to remove the "Proto-Germanic/" prefix for the dictionary to be usable on the Kindle but this can be done easily. Let's wait for the next Wiktionary dump to be published (October 20th, if I'm not mistaking).

swaree commented 4 years ago

Cool, thanks! Glad to hear it wasn't an issue of your program. And in exchange, we got three Wiktionary entries improved ^^. I'll try and share the result with you on that date.

nyg commented 4 years ago

Great, I will try it too. In the mean time I'm closing this issue, but if you have another problem then just create a new issue :).