pishoyg / coptic

This is a project that aims to make the Coptic language more learnable.
https://remnqymi.com/
GNU General Public License v3.0
3 stars 0 forks source link

Stop Working with so Many Huge Files #139

Open pishoyg opened 1 month ago

pishoyg commented 1 month ago

The size of your repo is growing exponentially. This is undesirable.

Consider chunking large files, such as the TSVs used for the dictionaries. The current pipeline writes a new version of a tens-of-megabytes-sized if even a single character has changed! (This could also speed up your pipelines.)

pishoyg commented 1 month ago

Note: The nice thing about TSV's is that chunking and gluing them together is quite straightforward! And this is the dominant format in your repo.

Besides your TSV's, you have:

Bible:

KELLIA:

Crum:

copticsite.com:

Flashcards:

Kindle:

pishoyg commented 1 month ago

Here is where you use TSV's:

pishoyg commented 1 month ago

Note: The most pressing files are the ones under dictionary/marcion.sourceforge.net/data/output/tsv/. These change quite often, and their size is considerable.

pishoyg commented 1 month ago

TSV's are too large, HTML's are too many. We should find something in between.

Idea: Define a tsvs format, which is simply a directory containing many TSV's that need to get concatenated.

Here is a nice way to implement write_tsvs:

CHUNK_SIZE = 100
def write_tsvs(df: pd.DataFrame, tsvs: str, zfill: int = 0) -> None:
  boundaries = list(range(CHUNK_SIZE, len(df.index), CHUNK_SIZE))
  zfill = zfill or len(str(len(boundaries) + 2))
  for idx, chunk in enumerate(np.array_split(df, boundaries)):
      chunk.to_csv(os.path.join(tsvs, str(idx).zfill(zfill) + ".tsv"), sep="\t", index=False)
pishoyg commented 1 month ago

TODO:

  1. Go over the large files listed above, run du --apparent-size --human-readable against them, and put the results here.
  2. Highlight the ones that are actively being developed.
  3. Define the logic for the new tsvs format in a new utils package that lives right under your root directory.
  4. Write and read the problematic files in tsvs.
  5. Move to the backlog, and revisit when another problematically large file starts receiving active attention.
pishoyg commented 1 month ago

The tsvs format is a success!

pishoyg commented 1 month ago
$duash dictionary/marcion.sourceforge.net/data/marcion-input/*
1014K   dictionary/marcion.sourceforge.net/data/marcion-input/coptdrv.tsv
746K    dictionary/marcion.sourceforge.net/data/marcion-input/coptwrd.tsv

I think it's OK to keep this in TSV format in order to be able to diff it against the raw data and list the typo fixes, the size is not that large. (Though you could probably use difftool to diff two directories, look into this!)

pishoyg commented 1 month ago
$ du --apparent-size --human-readable dictionary/kellia.uni-goettingen.de/data/v1.2/*.xml
7.2M    dictionary/kellia.uni-goettingen.de/data/v1.2/BBAW_Lexicon_of_Coptic_Egyptian-v4-2020.xml
5.5M    dictionary/kellia.uni-goettingen.de/data/v1.2/Comprehensive_Coptic_Lexicon-v1.2-2020.xml
4.8M    dictionary/kellia.uni-goettingen.de/data/v1.2/DDGLC_Lexicon_of_Greek_Loanwords_in_Coptic-v2-2020.xml
pishoyg commented 1 month ago
$ ls -Sl bible/stshenouda.org/data/input/*.json | head -n 5
-rw-r--r-- 1 pgirgis  3.5M Aug  7 14:24 bible/stshenouda.org/data/input/Psalms.json
-rw-r--r-- 1 pgirgis  2.0M Aug  7 14:24 bible/stshenouda.org/data/input/Genesis.json
-rw-r--r-- 1 pgirgis  1.9M Aug  7 14:24 bible/stshenouda.org/data/input/Jeremiah.json
-rw-r--r-- 1 pgirgis  1.8M Aug  7 14:24 bible/stshenouda.org/data/input/Ezekiel.json
-rw-r--r-- 1 pgirgis  1.7M Aug  7 14:24 bible/stshenouda.org/data/input/Isaiah.json
pishoyg commented 1 month ago

Pipelines currently writing TSVS:

Pipelines that will write TSVS in the future:

pishoyg commented 1 month ago

TODO: Your zfill calculation is buggy!

TODO: Use uniform chunk naming conventions for your TSVS format, and your Kindle.

pishoyg commented 1 month ago

Status:

pishoyg commented 1 month ago

Status:


$ find . -type d -name data | grep --invert '\./archive' | xargs ls
./bible/stshenouda.org/data:
img  input  output  raw

./dictionary/copticsite.com/data:
output  raw

./dictionary/kellia.uni-goettingen.de/data:
output  v1  v1.2

./dictionary/marcion.sourceforge.net/data:
crum  img  img-300  img-sources  marcion-dawoud  marcion-input  marcion-raw  notes  obsolete  output  snd-pishoy

./flashcards/data:
img  output```
pishoyg commented 3 weeks ago

Right now, we display some information about file sizes to the users. This should tell developers whether their commits are undesirably large.

pishoyg commented 2 weeks ago

FWIW: This hasn't been a problem for a while.

pishoyg commented 1 week ago

Here are a few thoughts:

All of that being said, growth actually comes from the commits that write a large number of files, rather than the commits that modify large files. For example, generating the flashcards in a new format writes 80MB worth of data! This should therefore be done sparingly.

One option is to stop writing the flashcards HTMLs. That would save space, but it would deprive us of the ability to git diff the output of our changes. We would like to retain them, but write them sparingly, and perhaps do the bulk of the magic in ~JavaScript~ TypeScript instead (#202).

I think the current status is acceptable.

pishoyg commented 3 days ago

Here are some huge files that we can get rid of by having Git ignore them (if it's about obtaining a diff, we can do so through the Site repo and just keep this repo clean):

Blocker: You won't be able to run the pre-commits against these files if Git is ignoring them.

Besides the above, we have some huge files belonging to projects that are not in active development, so we don't need to worry about them. Namely:

Once again, Git storage is optimized for storing text data, so all of this might actually be unnecessary.

pishoyg commented 3 days ago