Open pishoyg opened 1 month ago
Note: The nice thing about TSV's is that chunking and gluing them together is quite straightforward! And this is the dominant format in your repo.
Besides your TSV's, you have:
Bible:
output/html/
output/epub/
output/txt/
KELLIA:
Crum:
output/mobi
(XHTML)copticsite.com:
Flashcards:
Kindle:
Here is where you use TSV's:
dictionary/*
pipelines.coptwrd.tsv
, coptdrv.tsv
, notes.tsv
, and marcion_dawoud.tsv
).flashcards
pipeline. (You recently disabled that to save space. We will likely implement a similar timestamp tracking mechanism for the HTML files.)Note: The most pressing files are the ones under dictionary/marcion.sourceforge.net/data/output/tsv/
. These change quite often, and their size is considerable.
TSV's are too large, HTML's are too many. We should find something in between.
Idea: Define a tsvs
format, which is simply a directory containing many TSV's that need to get concatenated.
Here is a nice way to implement write_tsvs
:
CHUNK_SIZE = 100
def write_tsvs(df: pd.DataFrame, tsvs: str, zfill: int = 0) -> None:
boundaries = list(range(CHUNK_SIZE, len(df.index), CHUNK_SIZE))
zfill = zfill or len(str(len(boundaries) + 2))
for idx, chunk in enumerate(np.array_split(df, boundaries)):
chunk.to_csv(os.path.join(tsvs, str(idx).zfill(zfill) + ".tsv"), sep="\t", index=False)
TODO:
du --apparent-size --human-readable
against them, and put the results here.tsvs
format in a new utils
package that lives right under your root directory.tsvs
.The tsvs
format is a success!
$duash dictionary/marcion.sourceforge.net/data/marcion-input/*
1014K dictionary/marcion.sourceforge.net/data/marcion-input/coptdrv.tsv
746K dictionary/marcion.sourceforge.net/data/marcion-input/coptwrd.tsv
I think it's OK to keep this in TSV format in order to be able to diff
it against the raw data and list the typo fixes, the size is not that large. (Though you could probably use difftool
to diff
two directories, look into this!)
$ du --apparent-size --human-readable dictionary/kellia.uni-goettingen.de/data/v1.2/*.xml
7.2M dictionary/kellia.uni-goettingen.de/data/v1.2/BBAW_Lexicon_of_Coptic_Egyptian-v4-2020.xml
5.5M dictionary/kellia.uni-goettingen.de/data/v1.2/Comprehensive_Coptic_Lexicon-v1.2-2020.xml
4.8M dictionary/kellia.uni-goettingen.de/data/v1.2/DDGLC_Lexicon_of_Greek_Loanwords_in_Coptic-v2-2020.xml
$ ls -Sl bible/stshenouda.org/data/input/*.json | head -n 5
-rw-r--r-- 1 pgirgis 3.5M Aug 7 14:24 bible/stshenouda.org/data/input/Psalms.json
-rw-r--r-- 1 pgirgis 2.0M Aug 7 14:24 bible/stshenouda.org/data/input/Genesis.json
-rw-r--r-- 1 pgirgis 1.9M Aug 7 14:24 bible/stshenouda.org/data/input/Jeremiah.json
-rw-r--r-- 1 pgirgis 1.8M Aug 7 14:24 bible/stshenouda.org/data/input/Ezekiel.json
-rw-r--r-- 1 pgirgis 1.7M Aug 7 14:24 bible/stshenouda.org/data/input/Isaiah.json
Pipelines currently writing TSVS:
Pipelines that will write TSVS in the future:
TODO: Your zfill
calculation is buggy!
TODO: Use uniform chunk naming conventions for your TSVS format, and your Kindle.
Status:
html
output. (This should be picked up soon.)mobi
and epub
. (This is seeing no active development at the moment, so it can wait.)Status:
$ find . -type d -name data | grep --invert '\./archive' | xargs ls
./bible/stshenouda.org/data:
img input output raw
./dictionary/copticsite.com/data:
output raw
./dictionary/kellia.uni-goettingen.de/data:
output v1 v1.2
./dictionary/marcion.sourceforge.net/data:
crum img img-300 img-sources marcion-dawoud marcion-input marcion-raw notes obsolete output snd-pishoy
./flashcards/data:
img output```
Right now, we display some information about file sizes to the users. This should tell developers whether their commits are undesirably large.
FWIW: This hasn't been a problem for a while.
Here are a few thoughts:
All of that being said, growth actually comes from the commits that write a large number of files, rather than the commits that modify large files. For example, generating the flashcards in a new format writes 80MB worth of data! This should therefore be done sparingly.
One option is to stop writing the flashcards HTMLs. That would save space, but it would deprive us of the ability to git diff
the output of our changes. We would like to retain them, but write them sparingly, and perhaps do the bulk of the magic in ~JavaScript~ TypeScript instead (#202).
I think the current status is acceptable.
Here are some huge files that we can get rid of by having Git ignore them (if it's about obtaining a diff
, we can do so through the Site repo and just keep this repo clean):
Blocker: You won't be able to run the pre-commits against these files if Git is ignoring them.
Besides the above, we have some huge files belonging to projects that are not in active development, so we don't need to worry about them. Namely:
Once again, Git storage is optimized for storing text data, so all of this might actually be unnecessary.
tidy
) for HTML. That's probably the only hook that matters for HTML output.
The size of your repo is growing exponentially. This is undesirable.
Consider chunking large files, such as the TSVs used for the dictionaries. The current pipeline writes a new version of a tens-of-megabytes-sized if even a single character has changed! (This could also speed up your pipelines.)