jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 362 forks source link

Dataverse content provider: download files in original format #1242

Closed pdurbin closed 1 year ago

pdurbin commented 1 year ago

Proposed change

Apologies for not considering this back when ❀️ @Xarthisius ❀️ added the Dataverse content provider back in...

... but now that we've enabled the Binder button in Harvard Dataverse πŸŽ‰ πŸŽ‰ πŸŽ‰ ...

... I'm thinking we should teach repo2docker to always download the ORIGINAL format of files rather what we're doing now.

In short, rather than downloading a file using this URL:

https://dataverse.harvard.edu/api/access/datafile/3323458

We'd like to download the file with this URL instead:

https://dataverse.harvard.edu/api/access/datafile/3323458?format=original

Using the Stata file above as an example (which I found in a comment in dataverse.py), this will instruct Dataverse to download the original calendarmonth_fires_SPstate.dta version of the file rather than the archive-friendly tab delimited version calendarmonth_fires_SPstate.tab that Dataverse creates automatically.

Alternative options

This is really in the Dataverse weeds but we could work on this to solve perf problems (but not UX problems):

Who would use this feature?

The main users of this feature would be people who are trying to write scripts to reproduce results. Locally on their laptops, they're probably writing scripts to operate on their original file (calendarmonth_fires_SPstate.dta) and are surprised that Binder or Whole Tale are presenting the derived, archival version instead (calendarmonth_fires_SPstate.tab).

The other users are sysadmin who are noticing perf problems related to that 8524 issue above (again, details you probably don't need to worry about).

How much effort will adding it take?

?format=original is only adding 16 characters but let's say a day for testing, etc.

Who can do this work?

Oh, I could probably do it. Or someone else on the Dataverse team. Or @Xarthisius if he's up for it.

Should one of us go ahead and make a pull request? Please let us know! Thanks.

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

betatim commented 1 year ago

Thanks for filing the issue. It seems like a good idea, I think users would be surprised to find the format of the data to be different in their binder compared to what they uploaded. If you have time to create a PR that would be great.

pdurbin commented 1 year ago

@betatim great, thanks! I just created this issue to track the work on our side:

I'll work on getting it into a sprint.

pdurbin commented 1 year ago

I just created a pull request to fix this issue:

I tried to follow https://repo2docker.readthedocs.io/en/latest/contributing/contributing.html#guidelines-to-getting-a-pull-request-merged best I could! πŸ˜…

Oh, shoot, I forgot to add [MRG] to the title. Will fix. I did try to explain the "why" in the commit message, at least. πŸ˜„

pdurbin commented 1 year ago

Locally on their laptops, they're probably writing scripts to operate on their original file (calendarmonth_fires_SPstate.dta) and are surprised that Binder or Whole Tale are presenting the derived, archival version instead (calendarmonth_fires_SPstate.tab).

Now @minrk merged my pull request...

... this had been fixed! Thank you!!

Here's how that dataset looks in Dataverse. Note that the .tab versions (the plain text archival versions) are shown in the UI. We click the Binder button...

Screen Shot 2023-03-29 at 10 01 31 AM

... wait a bit...

Screen Shot 2023-03-29 at 10 02 49 AM

... and in Binder we can now see and operate on the original Stata (.dta) files!

Screen Shot 2023-03-29 at 10 03 59 AM