ropensci / onekp

Access sequences from the 1000 Plant Initiative (1KP)
https://docs.ropensci.org/onekp
Other
12 stars 4 forks source link

New Maintainer Wanted :-) #9

Open maelle opened 1 year ago

maelle commented 1 year ago

Or new maintainer team. :smile_cat:

:warning: Ideally the new maintainer would look for a better way to access One KP data than with GoogleDrive :warning:

If you're interested, please comment in the issue. For more info, see

Cc @ropensci/admin @arendsee

RijanHasTwoEars7 commented 1 year ago

Hi, @maelle

I am interested in learning more about contributing to the package.

I used OneKp data for research in my master's thesis and would love to help if possible. (to be pedantic, I used Onekp public release data and not the Onekp R package)

In terms of the codebase, my experience with R has been simple scripts here and there so far, and as such, I may or may not be qualified to hit the ground running on day one. If this is not an issue, then I am happy to try.

I tried to use this R package a few months back and had some issues documented here. I could not get the R package working and ended up writing this python script that automates the data scrapping process from the Onekp public release data. So, I am familiar with the data and automating its retrieval from the web. (Please note: As the data is hosted on Google Drive, I did end up running into google drive API/access limitations)

A few questions I have are:

1.Is data migration the primary goal?

a. Is the data being moved to an in-house data hosting setup?

b. If not, do you have a short list of potential candidates in mind? BackBlaze, Wasabi, NextCloud, a FTP setup?

c. Regardless of the service provider of choice, egress would be an issue with maintaining a project like this, right? So, hosting the data on the web in an accessible manner might come with some recurring costs from the data hosting service. Does ropensci sponsor this? If so, do you have documentation on how to set up recurring charges?

  1. Are there some etiquette I should make myself familiar with before trying to commit to the project? (For example, are there a minimum number of hours a maintainer should be available each week? Or deadlines on submitted issues? )

Sorry if I missed anything in the documentation you posted.

maelle commented 1 year ago

:wave: @RijanDhakal1010! Thank you for volunteering!

1.Is data migration the primary goal? a. Is the data being moved to an in-house data hosting setup? b. If not, do you have a short list of potential candidates in mind? BackBlaze, Wasabi, NextCloud, a FTP setup? c. Regardless of the service provider of choice, egress would be an issue with maintaining a project like this, right? So, hosting the data on the web in an accessible manner might come with some recurring costs from the data hosting service. Does ropensci sponsor this? If so, do you have documentation on how to set up recurring charges?

The goal would not be to migrate the data, but to contact OneKP maintainers to see what's the current best way to access their data: is it Google Drive, or something else?

Are there some etiquette I should make myself familiar with before trying to commit to the project? (For example, are there a minimum number of hours a maintainer should be available each week? Or deadlines on submitted issues? )

Thanks for asking. There's no such guideline with numbers.

I'm happy to answer more questions!

RijanHasTwoEars7 commented 1 year ago

Hello @maelle,

Yes, that cleared up my confusions.

I used the package recently to retrieve some data and the two biggest issues were:

  1. A warning about a deprecated dependency
  2. Noticeably slow retrieval of data, (700 mbs of data took 3.5 hours on a 1000 mbps home internet connection)

I think I will try to get in touch with the previous maintainer and/or onekp and see to explore online storage options.

This seems doable. Happy to get started.

Sincerely, Rijan

maelle commented 1 year ago

:wave: @RijanDhakal1010!

I think it makes more sense to contact the onekp team rather than the previous package maintainer.

Please post again when you know more / when you're more sure you want to become the maintainer so that I might give you access to this repository.

Thanks so much!

maelle commented 1 year ago

:wave: @RijanDhakal1010 were you able to contact the onekp team? No worries if not.

RijanHasTwoEars7 commented 1 year ago

Hi @maelle , I did email the onekp team but have not heard back from them. Not sure if I ended up in their spam folder, or if it's a defunct email or they have not had the time to get back to me.

I will send a new round of emails and see if I can hear back from them. I am more closely connected with folks who did auxiliary work on the Onekp project. If I do not hear back in sometime, I can try to get in touch with the Onekp team through them.

What do you think?

maelle commented 1 year ago

sounds like a great strategy, thanks so much for your efforts and for the update!

RijanHasTwoEars7 commented 1 year ago

Hi @maelle,

Some updates.

I got a chance to talk to the principle scientist behind the Onekp project and he is happy to help maintain the accessibility of the data for this project.

I learned this only after talking to him (Onekp principle scientist) but I did not know that the OneKP project and Ropensci OneKP R package are two completely separate projects and have not had much "direct" interactions. Not an issue but has left me with a few queries which can only be answered from the Ropensci side.

The onekp data is a few hundred gigabytes and as of now this R package is talking to a copy of the data on a google drive account/folder. How familiar are with this setup? Do you know who has been footing the bill for hosting this data on google drive? If Ropensci has been funding this, can the funding be diverted to an alternative data-hosting resource more suited for this package's goals? Or has the data been sitting in a free google account for non-profits? If no funding has been allocated so far, can it be allocated now? (Funding does not necessarily have to be capital and could be Ropensci's in house computing resources, if available)

I also found out that the biologist from OneKP had a database server setup for the data but their respective academic organization had the servers shut down for cybersecurity reasons pertaining to university policy. So, on campus FTP/SFTP servers from the original OneKP scientists are most likely not an option.

The original genomic data for the OneKP project does sit on Cyverse, which could be a makeshift alternative for hosting this data but would most likely require some significant changes in the code-base to switch from something like google drive. I say makeshift because this is a back-end dependency which may or may not be as reliable as google drive (the current problems from google drive notwithstanding).

What do you think?

Sincerely, Rijan

maelle commented 1 year ago

Thanks a ton @RijanDhakal1010!! I don't know anything about the original Google Drive setup, I'd recommend contacting @arendsee directly. Sorry to not be of more help!

RijanHasTwoEars7 commented 1 year ago

@maelle No worries! Will do!

maelle commented 1 year ago

@RijanDhakal1010 do you need an invitation to rOpenSci slack workspace? If so to which email address? Cc @yabellini

RijanHasTwoEars7 commented 1 year ago

@maelle I do need an invitation. Please send it to rijan_dhakal@outlook.com. Thank you!

maelle commented 1 year ago

Thank you! Note that invites are sent more or less weekly.

arendsee commented 1 year ago

Hi @RijanDhakal1010, so a bit of history. When I first implemented onekp it used the old FTP server and everything was awesome. I talked briefly to the database manager on the cyverse side. Eric Carpenter, I believe his name was. But the FTP site was working fine, so I was not motivated to change.

Then the transition to Google Drive happened and the onekp package blew up. You can check out the comments in #2 for a bit of context. joelnitta found a workaround. That was back in 2019.

A year later the package blew up again, see #3. And I hacked a solution in the suspiciously named commit "Fix #3 - possibly lose portability to windows". There I used a system call to curl. That was definitely not a good idea.

It might be a good idea to ditch Google Drive entirely and try to interface with cyverse. This might be a lot of work. You could ask the cyverse people if they have an API. In addition to all the complexity and bugs it causes, Google Drive is blocked in several countries.

arendsee commented 1 year ago

Oh, and Rijan you are an awesome person! I think you are on the right path and it is great that you have been in contact with OneKP team. It is easy to make packages like onekp, but it is much harder to inherit and maintain them.

RijanHasTwoEars7 commented 1 year ago

Hi @arendsee ,

Thank you for reaching out!

I can absolutely respect and understand why you had to use google drive. The issues notwithstanding, google's scale is definitely a plus point.

I think Cyverse will have to be the route to go for the back-end. I believe they do have a CLI tool called iRODS for interactive and automated data retrieval (could be wrong).

I will get back to this thread once I get a chance to search/read a bit more into Cyverse's APIs.

Sincerely, Rijan

RijanHasTwoEars7 commented 1 year ago

Hi @arendsee,

when you said you initially had an FTP server as the backend, was it the Cyverse SFTP API by any chance? If yes, then I might have accidentally re-invented the wheel but if no then it seems Cyverse does provide a currently stable SFTP interface to their public data folders. I have been using curl to interact with the data and so far the data transfer has been fast and convenient.

I have not run into any rate limits so far interacting with Cyverse via SFTP but I have emailed them to see if they have any rate limit polices that might hinder this idea (hopefully not!). If not then I think this is a viable solution to move forward with.The SFTP interface provides an easy way for people to anonymously access the public folder and R seems to have a functional SFTP interaction package. So, we could use SFTP within R and cyverse's public folders to replace google drive.

Sincerely, Rijan

arendsee commented 1 year ago

Hi @RijanDhakal1010, no, I never worked with the Cyverse API. Looks like you are onto a good solution!

RijanHasTwoEars7 commented 1 year ago

Hi @arendsee ,

Awesome! I also just heard back from Cyverse. They said they do not throttle egress so long as the number of concurrent connections is kept reasonable. So this works out as a viable solution to replace google drive.

Thank you!

Sincerely, Rijan

RijanHasTwoEars7 commented 1 year ago

Hi @maelle,

I think I have everything I need to start changing the back-end of the package from google drive to Cyverse. Did you want me to take over the repo? or work on it on my end and then make a pull request?

Sincerely, Rijan

maelle commented 1 year ago

@RijanDhakal1010 I've now invited you to the rOpenSci GitHub organization, and to a team with admin access to this repository! Sorry I hadn't done it earlier. Thanks so much for all your work on this!

For info we recently created a cheatsheet for maintainers of rOpenSci packages: https://devdevguide.netlify.app/maintenance_cheatsheet.html

RijanHasTwoEars7 commented 1 year ago

@maelle , Got it and accepted. Thank you!

maelle commented 1 year ago

@RijanDhakal1010 could you please update DESCRIPTION to change the maintainer? (removing the "cre" role from the previous maintainer, adding yourself with roles "aut" and "cre"). Thanks a lot!

RijanHasTwoEars7 commented 1 year ago

Hi @maelle, Apologies for the delay! I was able to clone the original repo to my machine (have read rights) but I am not being able to publish the changes. The specific error being github persmisssion denied class=Ssh (23); code=Eof (-20). Is this something on ropensci's end or mine?

Sorry this was not brought to your attention earlier, I have been working with the old code on a local fork and only just realized it now.

maelle commented 1 year ago

No worries, I've sent you a new invitation to the GitHub organization and a GitHub team with admin access to this repo. Note that you will need to have enabled 2FA see https://docs.github.com/en/authentication/securing-your-account-with-two-factor-authentication-2fa/configuring-two-factor-authentication

maelle commented 7 months ago

@RijanDhakal1010 did you end up getting access?

RijanHasTwoEars7 commented 7 months ago

Hi @maelle ,

I do have access to the repo and the data. I apologize but ever since I signed up here my professional responsibilities expanded somewhat unexpectedly and I was unable to debug the local branch for the repo that I have as much as I wanted to. But I have a rudimentary framework for how I want to apply the changes required by the package.

Right now, the biggest issue is not the code as much as the backend data. This package is a way to access the data published for the OneKP project, which itself was run by a consortium of scientists. The google drive backend as implemented right now hosts data that is not insignificantly different from the data that is in the public domain. The great thing about the public domain data is that it is hosted by an accredited research organization with reasonably generous access/egress. Switching to it will have great benefits in the long run but comes at the cost of making any new changes to the package non-backwards compatible.

I am of the mind that 30-ish species that are missing in the public domain are worth the cost of switching to a better backed source. If that does not go against Ropensci policy then I am happy to get the ball rolling in that direction.

Once again, I apologize for my tardiness here!

Sincerely, Rijan

maelle commented 7 months ago

Hi @RijanDhakal1010! Congrats on the job expansion!

I am of the mind that 30-ish species that are missing in the public domain are worth the cost of switching to a better backed source. If that does not go against Ropensci policy then I am happy to get the ball rolling in that direction.

It's your package so you are the one to decide! For what it's worth, to me your arguments sound perfectly good!

Cheers