Code repository contains huge installers

mahfiaz commented 2 years ago

This repository contains large unnecessary installer binaries in directories Qiqqa-Software-Installer-Releases and Qiqqa.Build, totaling 1046 MB. Also TestData should possibly be held separately.

883M    Qiqqa-Software-Installer-Releases
402M    TestData
358M    libs
163M    Qiqqa.Build
144M    docs-src
138M    docs
48M     Qiqqa
6.0M    icons
5.8M    Utilities
1.7M    Technology Tests
...

GerHobbelt commented 2 years ago

the installers are a mirror of Commercial Qiqqa installers for folks who need to backtrack and recover their old databases, e.g. when recovering or restarting work they did several years ago. qiqqa.com is only alive as Quantisle keeps paying domain and server costs and that time is limited.

It probably wasn't the smartest move to drop them in there, but it's done and moving them elsewhere isn't going to shrink the git repo. Alas. 🤷
test data included in the repo: this is one I've spent some more thought on. The decision there was made to keep at least a minimum viable test set with the source code so you have an 'all-together-now' kind of repo, where tests can be run and use test data that doesn't need extra actions to fetch.

Again, mix that with historical developments and you get more bytes than might have been, but this is another one for the 20:20 hindsight friday afternoon. 😄

BTW: if you're wondering "is that all the test data?", then may I refer you to the Evil Qiqqa Corpus where the production test set is kept. 😉

All in all, I don't worry too much about storage size. It has been a concern, but given my limited control over the original repo and the cost of "redoing it all", including my own historical choices, it is what it is.

Then there's the (to me) important consideration: "who is impacted?" and the answer to that one is simple: only Qiqqa developers. (Users download a release setup.exe and start from there.) And when you work on tools like these, whether your tool itself is large or small, but your incoming dataset is huge in variety as you'll fundamentally be accepting and processing every viable PDF out there, then you'll have to be able to handle a large file set for the test data at least, even if you only run production-level stability tests ever so often.

I run Qiqqa (and the new qiqqa backend tools research) through its paces on a ~80K files' library now and the nastiest problems only surface[d] with literally one (or in some cases very few) PDFs in that collection -- some of that nastiness is still under investigation and I do expect others to feed me extra PDFs in due time, which cause the oddest kind of problems, but setting up and using a large corpus allows me to at least make sure my core toolbase can be expected to cope rather decently. Qiqqa comes from a place (Commercial Qiqqa) where my actual research database of about 20K articles was causing enough trouble that Qiqqa was unable to get beyond the initial re-import-after-fatal-database-corrupting-crash phase (which was a repeating process as Qiqqa would crash-and-not-recover without a single day of usage), rendering it utterly unusable and making me angry enough to consider reverse-engineering the whole damn thing back in the day of the v70 series (Commercial Qiqqa).

Qiqqa is still quite fickle in many areas (UI is major headache and no fun at all) and needs a lot of work still. Yes, the git clone isn't as fast and lean as it could be, but right now, frankly speaking, it's currently the least of my worries.

I hope this gives so [historical] perspective re current situation, decisions made and the route we're traveling on.

TroyDanielFZ commented 2 years ago

It's really sad. I spend a lot of time, and find that, git clone failed to work due to the large size.

jimmejardine / qiqqa-open-source

Code repository contains huge installers #352