mdfeist / TypeV

Apache License 2.0
0 stars 0 forks source link

Using filename diff per commit could dramatically reduce JSON file size #39

Open eddieantonio opened 8 years ago

eddieantonio commented 8 years ago

Around 80% of the JSON is made up of all_files within each commit — the list of all filenames at that commit. This scales poorly: O(|files| * |commits|).

Instead, per each commit, we can say what files were added and removed from the last sequentially occurring commit. The sequence we chose is arbitrary, but ideally, it would minimize the size of the diff every time. Iterating by commit date works well until we deal with branches, but it's probably not a big deal.

Showin' ma work
$ http :57442/projects/antlr4/get_project type==Types | json > parsed.json                                                                                          
$ json dates < parsed.json | wc
  292127  467739 7479044
$ <parsed.json json commits | wc   
 1238030 1301928 80724731
$ <parsed.json json commits | json -a all_files | wc
 1194860 1194860 73953759
$ wc parsed.json
 1532796 1772460 91378688 parsed.json
$ dc
100 73953759 * 91378688 / p
80
eddieantonio commented 8 years ago

Yeesh! I re-did the calculation using the raw JSON instead of the parsed file, and I got that the filenames used 92% of the entire JSON file! That's like... 70.46 MiB (73953759 bytes)!

$ dc -e "100 $(<raw.json json commits | json -a all_files | wc -c) * $(cat raw.json | wc -c) / p"
92