Repository footprint is too large

ejhumphrey commented 6 years ago

Bordering on redundancy with #24, but worthy of its own callout – as of #27, the repository is now too large (≈1000MB). We knew going in that the proceedings were big, but nothing warned of the total size ... whoops. Per GitHubs docs:

We recommend repositories be kept under 1GB each. This limit is easy to stay within if large files are kept out of the repository. If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down.

In addition, we place a strict limit of files exceeding 100 MB in size. For more information, see "Working with large files."

In which case, I'd escalate this to "bug" level. If nothing else, cloning this repository is now bandwidth prohibitive, which is unfortunate.

Triage, however, is going to be a bit of work.

I think we can start moving at least the PDF proceedings to Zenodo manually. There are at most 18 and we currently only have a subset of them. We'll need the ISBNs for each, which is likely (?) in each PDF.
Once the PDFs are hosted elsewhere, we can fix the links in the website source.
We'll need to either go back and remove these items from the git history to get the repo size back down, or bail on this repository altogether. Keeping the git history is nice, but I'm not opposed to the latter.

stefan-balke commented 6 years ago

Related question: Who is hosting this?

http://www.ismir.net/proceedings

ejhumphrey commented 6 years ago

oh wow! I think (?) that's still IRCAM? but I don't understand the web well enough to know how that'd even still work. 🤔

DaDaBIK is the technology I had in mind re: shifting business models (no free version last I looked), and it didn't make sense to port it over to the new host (since Google search + DBLP are what they are).

stefan-balke commented 6 years ago

On the webspace there must be some redirect or a .modrewrite.

Do we have all this data somewhere since it seems to me that this was quite complete and we could simply scrape the whole site plus PDFs...

ejhumphrey commented 6 years ago

yep, this was all scraped / handed over in 2013 during The Great Migration ... however, DBLP keeps what looks to be better records on ISMIR than that proceedings DB, and I've also started writing tools to pull that information.

For a Zenodo bulk import, which would need corresponding metadata, I'm thinking one (good?) approach would be to back off the RDF records from DBLP, use that as source info for the Zenodo upload, and then provide updated RDF records back to DBLP so that the URLs could be updated.

stefan-balke commented 6 years ago

Okay, sounds good to me. Although I like DBLP, it seems to me to be redundant then...doesn't it?

ejhumphrey commented 6 years ago

yea, kinda ... but DBLP doesn't host content, and Zenodo records without associated metadata would be unfortunate. So long as the DBLP is used as the upstream and information (primarily) flows in one direction, I think it makes sense.

Also redundancy is helpful in case one of the two were to vanish 😄

ismir / ismir_web

Repository footprint is too large #29