Closed jglev closed 6 years ago
So there are two primary data outputs:
library_coverage_xml_and_fulltext_indicators.db
library_coverage_xml_and_fulltext_indicators.db
In general I like having a data
directory to host data files, but wherever they'll fit best is fine. We should track an xz-compressed version of library_coverage_xml_and_fulltext_indicators.db
using Git LFS. Unless we think the xz-compressed db will exceed 2 GB. Then we should not track it using git because GitHub will reject the file.
The next question is whether to compress the TSV. The benefit to compressing is smaller file sizes. With compression, I'd suggest enabling git LFS for that file. The benefit to not compressing is that we can get line-by-line diffs as the file changes. Being able to git diff
the TSV can help catch errors and provide feedback on updates. We should compress the TSV if it will exceed 50 MB.
I have the downloader running now.
On how many DOIs. The intial PR will probably be easiest with few DOIs. Then we can ramp up
On how many DOIs.
Since a full download will take several days, my plan is to just leave it running while I'm at work, until I have the full dataset (I was starting with the closed
DOIs, but also did some downloads just now to make sure that it works if I list all of the categories in the config. file). I can always just commit a few records in a PR initially if that makes it easier, though.
Unless we think the xz-compressed db will exceed 2 GB.
I've downloaded 4,345 DOIs so far, and the database is 27.8 MB. Assuming (perhaps naively, but also, I expect, conservatively) that storage will scale up linearly with more records, that's 27.8/4345 = 0.0064
MB per record. With ~300,000 records, that's 0.006*300,000 = 1800
MB of storage space for slightly more than the whole dataset (which has closer to 290,000 records). Thus, I expect we won't go over the 2 GB limit you mentioned.
In the data
directory you suggest, would you like just the output tsv
in there, or the database, too?
0.0064 MB per record
Was this with or without compression? I imagine compression will cut down the size considerably. In the longrun, I'd like a solution that can accommodate all ~80 million DOIs.
In the data directory you suggest, would you like just the output tsv in there, or the database, too?
I'd put both in data
as they're both datasets that the analysis generates.
Was this with or without compression?
That's without any compression.
I'd put both in
data
That sounds good to me. I'll open a PR in a bit with an in-progress script for all this.
To confirm, are you ok with using the tarfile
library for the compression aspect, or do you have a preferred alternative?
are you ok with using the tarfile library for the compression aspect
What are you tar balling (bundling)? Isn't the .db that we'd want to archive just a single file?
When writing the TSV, you can use the lzma module in python, or if via pandas just specify compression='xz'
.
Oh, I thought you wanted a copy of library_coverage_xml_and_fulltext_indicators.db
saved in a tarball! It is a single file.
To confirm, then, you want library_coverage_xml_and_fulltext_indicators.db
(not compressed), and then an lzma-compressed TSV -- is that correct?
To confirm, then, you want library_coverage_xml_and_fulltext_indicators.db (not compressed), and then an lzma-compressed TSV -- is that correct?
No! library_coverage_xml_and_fulltext_indicators.db
should certainly be compressed, since it's a binary file. Git diffs are not useful for this file, so might as well compress and track with LFS. Just use XZ... no need for a tarball since its a single file.
The TSV is where I'm debating whether we want to compress. If we compress, we lose the ability to get the informative git diffs, which show us new DOIs being added.
One final note is that the TSV should be deterministically sorted... I'd suggest by DOI. So that the TSV is always the same unless the underlying data changes.
I'm working to push my changes back to my fork to create a PR, but GitHub doesn't allow pushing LFS-tracked files into a public fork, it turns out (https://github.com/git-lfs/git-lfs/issues/1906).
Is there an approach for this that you've used in the past?
A comment here states:
On GitHub.com, you can't push LFS assets to a public fork unless the original repo already has LFS objects, or you have push access to the original repo. This is a server side rule to prevent abuse. Once the public repository has LFS objects, anyone should be able to push LFS objects to their forks. If you see an issue here, let me know the repository fork and the owner trying to push so we can look into it.
Okay I gave you write access which should allow you to be the first to push LFS assets to a fork. Continue to work through your fork and only edit the greenelab repo via pull request.
Great; thank you!
Following the merge/close of PR #7, I have the downloader running now.
While it runs, @dhimmel, perhaps we could discuss what you'd like done with it when it's ready? (Where it should go, whether any transformations / compression needs to be run on it, etc.)