Documentation of format

daphniz commented 4 years ago

This isn't your really your job of course and may not interest you, but I'd value any knowledge you took the time to share on the .txt and .idt formats, beyond your source-code comments, as documentation or here. (Unfortunately my perl reading skills aren't very good.) I found out so far that signed bytes set or increment counters, but not yet exactly how. For example, in an author indexed by work, page, line, what would set and increment these indices? What indicates how an author is indexed? I guess from your comments that there isn't much help to find anywhere else? Any hint appreciated. Thanks.

pjheslin commented 4 years ago

It was 20 years ago that I reverse-engineered the format of the databases, so my memory of much of it is quite hazy. I had no documentation at all when I did it, but after I released the first version of Diogenes, people sent me bits and pieces of PHI documentation. So I now have some of that. I could put the documents in a repo and upload it to Github, if you would find that interesting.

The format is very ingenious, but it was designed for a time when a computer might have had 8KB of free RAM. That's why each file is split into 8KB blocks, and each block is independent: it has all the information you need. The counters are manipulated as binary numbers. I believe that in the original Ibycus microcomputer they were decoded in hardware.

One of the reasons I wanted to implement XML output was to spare humanity the necessity of ever looking at the old database format ever again!

techvslife commented 4 years ago

From experience, I would strongly advise working with the Diogenes converted unicode and xml format files, rather than the original 7-bit betacode 8KB-blocked tlg data files.

You're not fooling about the original memory limitations: from http://stephanus.tlg.uci.edu/history.php

Migration out of Ibycus was a Herculean effort that lasted several months. Thousands of texts had to be downloaded from Ibycus using a 2200 baud modem. For two months, project staff worked 8-hour shifts downloading texts, one at a time. Large files were particularly difficult to download because Ibycus froze whoever a text exceeded its memory capabilities. Some texts were corrupted and had to be manually reconstructed. The Canon file containing thousands of bibliographical records was too large to download and had to be broken into pieces. When it was finally extracted, all formatting was lost and had to be re-entered manually. In September 1999, the project said farewell to Ibycus. The HP-1000 was disconnected and replaced by the new in-house system.

p.s. This will also convert tlg but (unlike Diogenes) fails with the largest tlg files (--fails with Plutarch and Aristotle; it does convert shorter files): https://cental.uclouvain.be/beta2uni/

Haven't tried this but it may work very well: http://tlgu.carmen.gr/

daphniz commented 4 years ago

@pjheslin, I wasn't aware of xml-export.pl, but that's very helpful, thanks! What I mainly wanted was just to have whole texts without having to go through a browser. Nevertheless, if you would like to share what documentation you have of the PHI format, I'd certainly find that interesting.

pjheslin commented 4 years ago

If you run xml-export.pl from the command-line, you get more options than if you run it from the GUI. The tools mentioned by @techvslife are, I believe, only for converting the bare Greek text from Beta code to Unicode, which is a fairly trivial operation. Diogenes tries to preserve the formatting information present in the texts, which in many cases is essential for the semantics.

I'll try to gather the PHI documentation I have.

techvslife commented 4 years ago

Thanks, according to the docs, the tlgu.carmen.gr utility does give at least useful options as far as how to make page/line (bekker, stephanus) numbers appear when exporting to unicode (but I haven't used that one yet, since I need to set up WSL2 (v2 of the Linux subsystem) on my windows box to use it easily).

I should mention here and in the other thread that all open source Perseus texts are available here in Unicode: https://github.com/PerseusDL/canonical-greekLit The json file "canonical-greekLit.tracking.json" contains the index.

daphniz commented 4 years ago

@techvslife, do you know how much this overlaps with the TLG?

techvslife commented 4 years ago

Good question. As far as I can tell, it seems that Perseus/Tufts released for download only a very limited selection of what they actually have, presumably due to copyright restrictions. For example, look at Aristotle. You'll find only a few works downloadable, --not the Physics for instance!! In tlg, of course it's there. Also it's in the full Perseus library (seen from the scaife browser): https://scaife.perseus.org/library/

So I think the tlg cd is still by far the easiest solution (if you can get access through a library or institution). (Though theoretically I suppose one could download chunk by chunk (page by page) from perseus and reassemble the file, it would likely be a copyright violation.)

techvslife commented 4 years ago

Actually, you can find the Physics and some other texts for download here: https://opengreekandlatin.github.io/First1KGreek/

pjheslin commented 4 years ago

It is incorrect to say that Perseus only permits downloads of a subset of their texts; they are all open access. The Scaife viewer includes texts from the First1KGreek project, which is separate from Perseus.

Perseus was not designed to be as comprehensive as the TLG. It has less coverage not due to copyright restrictions but due to having less money than the TLG. First1KGreek is a new project from the creators of Perseus which is designed to have similar coverage to the TLG, but using new OCR technology.

I have made an interface to all these corpora and others here: https://d.iogen.es/web

techvslife commented 4 years ago

Thank you for explaining the Perseus situation—I had not known the reason (whether copyright issues or other), only that many of the texts are not directly downloadable from that (main) Perseus website. I know the scholars there work tirelessly to make as many texts freely and widely accessible as they can, given their resources. And thank you for your one-stop web interface.

As far as downloading the unicode greek texts in their entirety (assuming one does not have tlg cd access), I believe it’s best to go directly to these two download sites:

https://github.com/PerseusDL/canonical-greekLit

and (for the others that are viewable on Perseus/scaife but not directly downloadable there):

https://opengreekandlatin.github.io/First1KGreek/

There may be others but these are the ones I know.

pjheslin commented 4 years ago

I've collected all the PHI/TLG documentation that I have and I have put it up here:

https://github.com/pjheslin/phi-tlg-docs

pjheslin / diogenes

Documentation of format #56