Stop Working with so Many Huge Files

pishoyg commented 1 month ago

The size of your repo is growing exponentially. This is undesirable.

Consider chunking large files, such as the TSVs used for the dictionaries. The current pipeline writes a new version of a tens-of-megabytes-sized if even a single character has changed! (This could also speed up your pipelines.)

Consider also removing redundant formats. Maybe the flashcards don't need to be in a TSV, and the one-card HTMLs suffice.

pishoyg commented 1 month ago

Note: The nice thing about TSV's is that chunking and gluing them together is quite straightforward! And this is the dominant format in your repo.

Besides your TSV's, you have:

Bible:

Input JSON
output/html/
output/epub/
output/txt/

KELLIA:

Input XML

Crum:

output/mobi (XHTML)

copticsite.com:

XLSX

Flashcards:

HTML
Anki

Kindle:

Mobi

pishoyg commented 1 month ago

Here is where you use TSV's:

As outputs of your dictionary/* pipelines.
As input to the Crum pipeline (coptwrd.tsv, coptdrv.tsv, notes.tsv, and marcion_dawoud.tsv).
As output of your flashcards pipeline. (You recently disabled that to save space. We will likely implement a similar timestamp tracking mechanism for the HTML files.)

pishoyg commented 1 month ago

Note: The most pressing files are the ones under dictionary/marcion.sourceforge.net/data/output/tsv/. These change quite often, and their size is considerable.

pishoyg commented 1 month ago

TSV's are too large, HTML's are too many. We should find something in between.

Idea: Define a tsvs format, which is simply a directory containing many TSV's that need to get concatenated.

Here is a nice way to implement write_tsvs:

CHUNK_SIZE = 100
def write_tsvs(df: pd.DataFrame, tsvs: str, zfill: int = 0) -> None:
  boundaries = list(range(CHUNK_SIZE, len(df.index), CHUNK_SIZE))
  zfill = zfill or len(str(len(boundaries) + 2))
  for idx, chunk in enumerate(np.array_split(df, boundaries)):
      chunk.to_csv(os.path.join(tsvs, str(idx).zfill(zfill) + ".tsv"), sep="\t", index=False)

pishoyg commented 1 month ago

TODO:

Go over the large files listed above, run du --apparent-size --human-readable against them, and put the results here.
Highlight the ones that are actively being developed.
Define the logic for the new tsvs format in a new utils package that lives right under your root directory.
Write and read the problematic files in tsvs.
Move to the backlog, and revisit when another problematically large file starts receiving active attention.

pishoyg commented 1 month ago

The tsvs format is a success!

pishoyg commented 1 month ago

$duash dictionary/marcion.sourceforge.net/data/marcion-input/*
1014K   dictionary/marcion.sourceforge.net/data/marcion-input/coptdrv.tsv
746K    dictionary/marcion.sourceforge.net/data/marcion-input/coptwrd.tsv

I think it's OK to keep this in TSV format in order to be able to diff it against the raw data and list the typo fixes, the size is not that large. (Though you could probably use difftool to diff two directories, look into this!)

pishoyg commented 1 month ago

$ du --apparent-size --human-readable dictionary/kellia.uni-goettingen.de/data/v1.2/*.xml
7.2M    dictionary/kellia.uni-goettingen.de/data/v1.2/BBAW_Lexicon_of_Coptic_Egyptian-v4-2020.xml
5.5M    dictionary/kellia.uni-goettingen.de/data/v1.2/Comprehensive_Coptic_Lexicon-v1.2-2020.xml
4.8M    dictionary/kellia.uni-goettingen.de/data/v1.2/DDGLC_Lexicon_of_Greek_Loanwords_in_Coptic-v2-2020.xml

pishoyg commented 1 month ago

$ ls -Sl bible/stshenouda.org/data/input/*.json | head -n 5
-rw-r--r-- 1 pgirgis  3.5M Aug  7 14:24 bible/stshenouda.org/data/input/Psalms.json
-rw-r--r-- 1 pgirgis  2.0M Aug  7 14:24 bible/stshenouda.org/data/input/Genesis.json
-rw-r--r-- 1 pgirgis  1.9M Aug  7 14:24 bible/stshenouda.org/data/input/Jeremiah.json
-rw-r--r-- 1 pgirgis  1.8M Aug  7 14:24 bible/stshenouda.org/data/input/Ezekiel.json
-rw-r--r-- 1 pgirgis  1.7M Aug  7 14:24 bible/stshenouda.org/data/input/Isaiah.json

pishoyg commented 1 month ago

Pipelines currently writing TSVS:

Bible
Crum
Flashcards

Pipelines that will write TSVS in the future:

KELLIA
copticsite

pishoyg commented 1 month ago

TODO: Your zfill calculation is buggy!

TODO: Use uniform chunk naming conventions for your TSVS format, and your Kindle.

pishoyg commented 1 month ago

Status:

We don't use TSV as an output format anymore, our pipelines now write TSVS.
We use TSV as an input format for Crum, but this is acceptable because the file sizes are less than 1MB (see).
Other formats are listed here. We address a few of them below:
- 151 will handle the Bible html output. (This should be picked up soon.)
- 152 is a good candidate for mobi and epub. (This is seeing no active development at the moment, so it can wait.)
- 154 is a good candidate for Crum's XHTML output. (This is seeing no active development at the moment, so it can wait.)
- TODO: Think about the Bible's input JSON and output TXT, the KELLIA's input XML, and copticsite's input XLSX. (These are seeing no active development at the moment, so they can wait.)

pishoyg commented 1 month ago

Status:


$ find . -type d -name data | grep --invert '\./archive' | xargs ls
./bible/stshenouda.org/data:
img  input  output  raw

./dictionary/copticsite.com/data:
output  raw

./dictionary/kellia.uni-goettingen.de/data:
output  v1  v1.2

./dictionary/marcion.sourceforge.net/data:
crum  img  img-300  img-sources  marcion-dawoud  marcion-input  marcion-raw  notes  obsolete  output  snd-pishoy

./flashcards/data:
img  output```

pishoyg commented 3 weeks ago

Right now, we display some information about file sizes to the users. This should tell developers whether their commits are undesirably large.

pishoyg commented 2 weeks ago

FWIW: This hasn't been a problem for a while.

pishoyg commented 1 week ago

Here are a few thoughts:

This problem is mainly in our head. The repo size growth is not exponential, and we're far from approaching a size that is difficult to handle for Git or GitHub.
Making programmers aware has already helped reduce the frequency of commits that rewrite large chunks of data.
Git has its storage optimization tricks, so a file that had one line rewritten doesn't necessarily require space that is equal to the size of the whole file.
Our largest individual files are just a few megabytes.
We will probably get rid of EPUB and MOBI in #220, and treat them like ANKI.

All of that being said, growth actually comes from the commits that write a large number of files, rather than the commits that modify large files. For example, generating the flashcards in a new format writes 80MB worth of data! This should therefore be done sparingly.

One option is to stop writing the flashcards HTMLs. That would save space, but it would deprive us of the ability to git diff the output of our changes. We would like to retain them, but write them sparingly, and perhaps do the bulk of the magic in ~JavaScript~ TypeScript instead (#202).

I think the current status is acceptable.

pishoyg commented 3 days ago

Here are some huge files that we can get rid of by having Git ignore them (if it's about obtaining a diff, we can do so through the Site repo and just keep this repo clean):

Flashcards output HTML.
Xooxle JSON.
Bible output HTML.

Blocker: You won't be able to run the pre-commits against these files if Git is ignoring them.

Besides the above, we have some huge files belonging to projects that are not in active development, so we don't need to worry about them. Namely:

Bible input JSON (largest file stands at 3.5MB)
KELLIA
copticsite

Once again, Git storage is optimized for storing text data, so all of this might actually be unnecessary.

pishoyg commented 3 days ago

We can ignore the output in cases when we don't care about running the pre-commit hooks. This is the case, for example, for the Xooxle index. It's a single central file, and it's just JSON, the pre-commit hooks literally do nothing to it!
We can also run some of the hooks manually (such as tidy) for HTML. That's probably the only hook that matters for HTML output.
Writing clean HTML in the first place might also often render the hooks unnecessary.

pishoyg / coptic

Stop Working with so Many Huge Files #139

151 will handle the Bible `html` output. (This should be picked up soon.)

152 is a good candidate for `mobi` and `epub`. (This is seeing no active development at the moment, so it can wait.)

154 is a good candidate for Crum's XHTML output. (This is seeing no active development at the moment, so it can wait.)

pishoyg / coptic

Stop Working with so Many Huge Files #139

151 will handle the Bible html output. (This should be picked up soon.)

152 is a good candidate for mobi and epub. (This is seeing no active development at the moment, so it can wait.)

154 is a good candidate for Crum's XHTML output. (This is seeing no active development at the moment, so it can wait.)

151 will handle the Bible `html` output. (This should be picked up soon.)

152 is a good candidate for `mobi` and `epub`. (This is seeing no active development at the moment, so it can wait.)