Merging the downloaded data

jglev commented 7 years ago

Following the merge/close of PR #7, I have the downloader running now.

While it runs, @dhimmel, perhaps we could discuss what you'd like done with it when it's ready? (Where it should go, whether any transformations / compression needs to be run on it, etc.)

dhimmel commented 7 years ago

So there are two primary data outputs:

library_coverage_xml_and_fulltext_indicators.db
a TSV which gets extracted from library_coverage_xml_and_fulltext_indicators.db

In general I like having a data directory to host data files, but wherever they'll fit best is fine. We should track an xz-compressed version of library_coverage_xml_and_fulltext_indicators.db using Git LFS. Unless we think the xz-compressed db will exceed 2 GB. Then we should not track it using git because GitHub will reject the file.

The next question is whether to compress the TSV. The benefit to compressing is smaller file sizes. With compression, I'd suggest enabling git LFS for that file. The benefit to not compressing is that we can get line-by-line diffs as the file changes. Being able to git diff the TSV can help catch errors and provide feedback on updates. We should compress the TSV if it will exceed 50 MB.

dhimmel commented 7 years ago

I have the downloader running now.

On how many DOIs. The intial PR will probably be easiest with few DOIs. Then we can ramp up

jglev commented 7 years ago

On how many DOIs.

Since a full download will take several days, my plan is to just leave it running while I'm at work, until I have the full dataset (I was starting with the closed DOIs, but also did some downloads just now to make sure that it works if I list all of the categories in the config. file). I can always just commit a few records in a PR initially if that makes it easier, though.

Unless we think the xz-compressed db will exceed 2 GB.

I've downloaded 4,345 DOIs so far, and the database is 27.8 MB. Assuming (perhaps naively, but also, I expect, conservatively) that storage will scale up linearly with more records, that's 27.8/4345 = 0.0064 MB per record. With ~300,000 records, that's 0.006*300,000 = 1800 MB of storage space for slightly more than the whole dataset (which has closer to 290,000 records). Thus, I expect we won't go over the 2 GB limit you mentioned.

In the data directory you suggest, would you like just the output tsv in there, or the database, too?

dhimmel commented 7 years ago

0.0064 MB per record

Was this with or without compression? I imagine compression will cut down the size considerably. In the longrun, I'd like a solution that can accommodate all ~80 million DOIs.

In the data directory you suggest, would you like just the output tsv in there, or the database, too?

I'd put both in data as they're both datasets that the analysis generates.

jglev commented 7 years ago

Was this with or without compression?

That's without any compression.

I'd put both in data

That sounds good to me. I'll open a PR in a bit with an in-progress script for all this.

jglev commented 7 years ago

To confirm, are you ok with using the tarfile library for the compression aspect, or do you have a preferred alternative?

dhimmel commented 7 years ago

are you ok with using the tarfile library for the compression aspect

What are you tar balling (bundling)? Isn't the .db that we'd want to archive just a single file?

When writing the TSV, you can use the lzma module in python, or if via pandas just specify compression='xz'.

jglev commented 7 years ago

Oh, I thought you wanted a copy of library_coverage_xml_and_fulltext_indicators.db saved in a tarball! It is a single file.

To confirm, then, you want library_coverage_xml_and_fulltext_indicators.db (not compressed), and then an lzma-compressed TSV -- is that correct?

dhimmel commented 7 years ago

To confirm, then, you want library_coverage_xml_and_fulltext_indicators.db (not compressed), and then an lzma-compressed TSV -- is that correct?

No! library_coverage_xml_and_fulltext_indicators.db should certainly be compressed, since it's a binary file. Git diffs are not useful for this file, so might as well compress and track with LFS. Just use XZ... no need for a tarball since its a single file.

The TSV is where I'm debating whether we want to compress. If we compress, we lose the ability to get the informative git diffs, which show us new DOIs being added.

One final note is that the TSV should be deterministically sorted... I'd suggest by DOI. So that the TSV is always the same unless the underlying data changes.

jglev commented 7 years ago

I'm working to push my changes back to my fork to create a PR, but GitHub doesn't allow pushing LFS-tracked files into a public fork, it turns out (https://github.com/git-lfs/git-lfs/issues/1906).

Is there an approach for this that you've used in the past?

jglev commented 7 years ago

A comment here states:

On GitHub.com, you can't push LFS assets to a public fork unless the original repo already has LFS objects, or you have push access to the original repo. This is a server side rule to prevent abuse. Once the public repository has LFS objects, anyone should be able to push LFS objects to their forks. If you see an issue here, let me know the repository fork and the owner trying to push so we can look into it.

dhimmel commented 7 years ago

Okay I gave you write access which should allow you to be the first to push LFS assets to a fork. Continue to work through your fork and only edit the greenelab repo via pull request.

jglev commented 7 years ago

Great; thank you!

greenelab / library-access

Merging the downloaded data #11