pjheslin / diogenes

Diogenes: an environment for reading Latin and Greek
https://d.iogen.es/d
Other
56 stars 10 forks source link

TLL pdf names changed #51

Closed mingshey closed 4 years ago

mingshey commented 4 years ago

"Download TLL PDFs" didn't work, maybe because the TLL pdfs got their names changed. I downloaded the pdfs from TLL Open Access one by one and tweaked the dependencies/data/tll-pdf-list.txt to match the dowloaded file names.

For future saving manual download, I also tried tweaking the last command of server/tll-pdf-download.pl, say sub tll_files { return (...) };. But it did not work; Only (almost) empty files --- of a few kilobytes--- were downloaded.

pjheslin commented 4 years ago

Yes -- thanks for lettign me know -- the BAdW have changed their website. The paths and filenames of the TLL files have all changed. That's very annoying. When they first released the TLL PDFs, they had a note on their website asking that people only download the files from their site. I respected that request, even though it flagrantly contradicts the terms of the CC license they chose. But if they are going to change the location of the files, I'll need to host them elsewhere.

They now seem to have removed that strange request from their website. So what I think I will do is to host them somewhere like github (in keeping with the terms of the license) and have Diogenes download them from there.

This will take some time to fix, so please be patient.

mingshey commented 4 years ago

Thank you for your tremendous work. For now, I have managed to get TLL working and am enjoying the Diogenes with the new features introduced in version 4.x. OLD shows a couple of page mismatches for a few words, it doesn't bother me much, since the mismatch is not big, and scrolling a few pages is not a big deal, also I can encounter related words in the way as I often do with paper dictionaries. Or I could use my spare time tweaking the lookup table.

Best Regards, Mingshey.

pjheslin commented 4 years ago

I'm glad you got it working. With the OLD, I think there are differing PDFs in circulation which vary in how many pages of prefatory material they have. The page numbers are correct for the PDF I have, but if they are offset for yours, an easy way to fix it would be to add or delete the required number of pages from the beginning of your PDF.

mingshey commented 4 years ago

Thank you for your close concern and advice. In my case the OLD hit accurately on atque, oscillum, and zythos, for example, but oryza led to page 1233, the head entry for O instead of 1295. This was a very exceptional case, for I found out that the entry "oryza" was missing in the old-bookmarks.txt, and adding "oryza1295 after "orsa1294" did not amend it. Some persistent cache, I assume?

2020년 5월 6일 (수) 오후 6:37, pjheslin notifications@github.com님이 작성:

I'm glad you got it working. With the OLD, I think there are differing PDFs in circulation which vary in how many pages of prefatory material they have. The page numbers are correct for the PDF I have, but if they are offset for yours, an easy way to fix it would be to add or delete the required number of pages from the beginning of your PDF.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pjheslin/diogenes/issues/51#issuecomment-624544091, or unsubscribe https://github.com/notifications/unsubscribe-auth/APOTGYZ4Y3CHLOQNPJRDSYLRQEVWNANCNFSM4MZKAG2Q .

mingshey commented 4 years ago

I had wrong idea about what the entries in old-bookmarks.txt. Now I see it's the list of last entry words of each page. Then I suspect that in the L-S entry 'oryza ' contains some character that precedes 'b' that the algorithm thinks it comes after 'nysigena' and before 'obaerari'.

2020년 5월 7일 (목) 오후 5:00, Mingshey mingshey@gmail.com님이 작성:

Thank you for your close concern and advice. In my case the OLD hit accurately on atque, oscillum, and zythos, for example, but oryza led to page 1233, the head entry for O instead of 1295. This was a very exceptional case, for I found out that the entry "oryza" was missing in the old-bookmarks.txt, and adding "oryza1295 after "orsa1294" did not amend it. Some persistent cache, I assume?

                      • +
          • + πάντα χωρεῖ, καὶ οὐδὲν μένει

2020년 5월 6일 (수) 오후 6:37, pjheslin notifications@github.com님이 작성:

I'm glad you got it working. With the OLD, I think there are differing PDFs in circulation which vary in how many pages of prefatory material they have. The page numbers are correct for the PDF I have, but if they are offset for yours, an easy way to fix it would be to add or delete the required number of pages from the beginning of your PDF.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pjheslin/diogenes/issues/51#issuecomment-624544091, or unsubscribe https://github.com/notifications/unsubscribe-auth/APOTGYZ4Y3CHLOQNPJRDSYLRQEVWNANCNFSM4MZKAG2Q .

mingshey commented 4 years ago

My speculation is that the L-S entry of 'oryza' in lat.ls.perseus-eng1.xml has "o^ry_za" as its key and the character "hat(^)" precedes lowercase letters in ascii code table. So if the algorithm uses ascii based ordering for the search, o^ry_za would be determined to come before obaerari which is in page1233. Based on this assumption my suggestion is to remove all diacritic characters from the key before searching the entry page in OLD. Thank you for paying attention to my humble opinion.

2020년 5월 7일 (목) 오후 8:23, Mingshey mingshey@gmail.com님이 작성:

I had wrong idea about what the entries in old-bookmarks.txt. Now I see it's the list of last entry words of each page. Then I suspect that in the L-S entry 'oryza ' contains some character that precedes 'b' that the algorithm thinks it comes after 'nysigena' and before 'obaerari'.

                      • +
          • + πάντα χωρεῖ, καὶ οὐδὲν μένει

2020년 5월 7일 (목) 오후 5:00, Mingshey mingshey@gmail.com님이 작성:

Thank you for your close concern and advice. In my case the OLD hit accurately on atque, oscillum, and zythos, for example, but oryza led to page 1233, the head entry for O instead of 1295. This was a very exceptional case, for I found out that the entry "oryza" was missing in the old-bookmarks.txt, and adding "oryza1295 after "orsa1294" did not amend it. Some persistent cache, I assume?

                      • +
          • + πάντα χωρεῖ, καὶ οὐδὲν μένει

2020년 5월 6일 (수) 오후 6:37, pjheslin notifications@github.com님이 작성:

I'm glad you got it working. With the OLD, I think there are differing PDFs in circulation which vary in how many pages of prefatory material they have. The page numbers are correct for the PDF I have, but if they are offset for yours, an easy way to fix it would be to add or delete the required number of pages from the beginning of your PDF.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/pjheslin/diogenes/issues/51#issuecomment-624544091, or unsubscribe https://github.com/notifications/unsubscribe-auth/APOTGYZ4Y3CHLOQNPJRDSYLRQEVWNANCNFSM4MZKAG2Q .

mingshey commented 4 years ago

I think the OLD pdf search problem makes a separate issue and I'm going to take it to a new issue

pjheslin commented 4 years ago

Thanks! I'll leave this issue open until I fix the problem with downloading the TLL PDFs

pjheslin commented 4 years ago

Commit d5f6bf1e1b65d58534d422ad0d988e6d4d65f56b fixes the TLL download bug. It still downloads them from the BAdW website, from the new locations, but uses the old filenames for backward compatibility. I'll release a new version with this fix soon. Thanks again for reporting the problem!

mingshey commented 4 years ago

Glad it was worth reporting. I'll be waiting for the new build. Thanks!

mingshey commented 4 years ago

I've tried out version 4.5 overnight and TLL PDFs Download works perfectly on Linux(Ubuntu 18.04LTS) and references worked all right, but on Windows 10 it started no sooner than it stopped downloading anything. I suspect difference of path name handling between the OS'es.

Edit: Even after I copied the downloaded tll-pdfs from Linux to Windows, TLL link fails to refer to the file and shows Error message:

404 Not Found Requested pdf file (D:\Diogenes-Data\tll-pdfs\000914819{ThLL vol. 09.2 col. 0625?1214 (omnividentia?ozynosus)}[CC BY-NC-ND].pdf) was not found.

The error message shows a symptom of failure to interpret the unicode symbol for En Dash (–), if it helps

pjheslin commented 4 years ago

Apologies for not following up on this. This does look like an issue with Unicode filenames on Windows. Downloading the TLL files does work on Windows 10 for me. The difference, I suspect, is that I have a Western default encoding (codepage) for my Windows machine, but you may not.

This is also a problem when people put the PHI/TLG databases in a folder path with Unicode accents under Windows. I was never able to fix that problem.

The best solution for the TLL files is that in the future I should probably rename them when downloading them to TLL01.pdf and so on.

mingshey commented 4 years ago

Thank you for your advice. I could write a simple batch file to handle the renaming.