Closed pdurbin closed 1 year ago
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Thanks for filing the issue. It seems like a good idea, I think users would be surprised to find the format of the data to be different in their binder compared to what they uploaded. If you have time to create a PR that would be great.
@betatim great, thanks! I just created this issue to track the work on our side:
I'll work on getting it into a sprint.
I just created a pull request to fix this issue:
I tried to follow https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html#guidelines-to-getting-a-pull-request-merged best I could! π
Oh, shoot, I forgot to add [MRG]
to the title. Will fix. I did try to explain the "why" in the commit message, at least. π
Locally on their laptops, they're probably writing scripts to operate on their original file (
calendarmonth_fires_SPstate.dta
) and are surprised that Binder or Whole Tale are presenting the derived, archival version instead (calendarmonth_fires_SPstate.tab
).
Now @minrk merged my pull request...
... this had been fixed! Thank you!!
Here's how that dataset looks in Dataverse. Note that the .tab versions (the plain text archival versions) are shown in the UI. We click the Binder button...
... wait a bit...
... and in Binder we can now see and operate on the original Stata (.dta) files!
Proposed change
Apologies for not considering this back when β€οΈ @Xarthisius β€οΈ added the Dataverse content provider back in...
739
... but now that we've enabled the Binder button in Harvard Dataverse π π π ...
... I'm thinking we should teach repo2docker to always download the ORIGINAL format of files rather what we're doing now.
In short, rather than downloading a file using this URL:
https://dataverse.harvard.edu/api/access/datafile/3323458
We'd like to download the file with this URL instead:
https://dataverse.harvard.edu/api/access/datafile/3323458?format=original
Using the Stata file above as an example (which I found in a comment in dataverse.py), this will instruct Dataverse to download the original
calendarmonth_fires_SPstate.dta
version of the file rather than the archive-friendly tab delimited versioncalendarmonth_fires_SPstate.tab
that Dataverse creates automatically.Alternative options
This is really in the Dataverse weeds but we could work on this to solve perf problems (but not UX problems):
Who would use this feature?
The main users of this feature would be people who are trying to write scripts to reproduce results. Locally on their laptops, they're probably writing scripts to operate on their original file (
calendarmonth_fires_SPstate.dta
) and are surprised that Binder or Whole Tale are presenting the derived, archival version instead (calendarmonth_fires_SPstate.tab
).The other users are sysadmin who are noticing perf problems related to that 8524 issue above (again, details you probably don't need to worry about).
How much effort will adding it take?
?format=original
is only adding 16 characters but let's say a day for testing, etc.Who can do this work?
Oh, I could probably do it. Or someone else on the Dataverse team. Or @Xarthisius if he's up for it.
Should one of us go ahead and make a pull request? Please let us know! Thanks.