numfocus / gsoc

NumFOCUS Google Summer of Code Materials
http://numfocus.org/programs/gsoc.html
463 stars 296 forks source link

Add taxonomic name resolution to the EcoData Retriever to facilitate data science approaches to ecology - doubts #105

Closed ghoshbishakh closed 8 years ago

ghoshbishakh commented 8 years ago

@ethanwhite @henrykironde Hi i am Bishakh Ghosh, I am really interested to participate in GSOC 2016 under EcoData Retriever. I have cloned the source code and installed it and was able to download a couple of datasets. I have made one PR https://github.com/weecology/retriever/pull/442 however I don't know if it is any good.

I have knowledge of python and some web frameworks like Django and Flask and I know how to use REST APIs. Also I have some knowledge of SQL and ORMs like SQL Alchemy. Here is a list of my works: http://ghoshbishakh.github.io/blog/about/

What should be my next step? Should I try to make a sample app that fetches data from iPlant's Taxonomic Name Resolution Service?

Also is it possible to apply for both the projects as both seem interesting to me, although since I know web APIs a bit so my first preferance will be adding taxonomic name resolution to the EcoData Retriever.

ethanwhite commented 8 years ago

Hi @ghoshbishakh - I'm glad to hear that you're interested in working with us on the EcoData Retriever. Thanks for already contributing. The first PR was definitely a nice addition!

Yes, I think that doing some experimenting with either iPlant or another TNRS is a good next step. One good way to that would be using pytaxize, which is a Python module for interacting with a number of different TNRSs.

It's written by @sckott who might be involved in this project as a mentor. Speaking of which - @sckott, would you be willing to serve on this with @henrykironde and I if this project ends up getting supported.

ethanwhite commented 8 years ago

@ghoshbishakh - I can't remember the exact GSoC rules. I know you can apply to multiple projects, I'm not sure whether or not you can apply to multiply projects within the same organization. My general suggestion would be to chat with us about the different ideas/options and then write a single proposal EcoData Retriever to allow you to focus fully on making that the best possible proposal. That said, as long as it's within GSoC's rules, I don't have any personal objection to you submitting two proposals.

sckott commented 8 years ago

hi @ghoshbishakh ! Yes, willing to serve with @henrykironde and you.

rgaiacs commented 8 years ago

I'm not sure whether or not you can apply to multiply projects within the same organization.

Google changed the system than handle the applications. We can test this next Monday.

@sckott Could you email me (raniere@rgaiacs.com) just to me invite you as a mentor? Thanks.

ghoshbishakh commented 8 years ago

@ethanwhite

I am concentrating on taxonomic name resolution then!

Yes, I think that doing some experimenting with either iPlant or another TNRS is a good next step.

I was eperimenting with iPlant and will study pytaxize. But I am not completely clear about the process of reconcilation of different species names.

One of the challenges of ecological (and evolutionary) data is that the names of species are constantly being redefined. This makes it difficult to combine datasets to do interesting science. By automating reconciliation of different species names as part of the process of accessing the data..

Can you please provide me names of some datasets where this problem is very apparent? And may be some example of how some of the names can be reconciled? Or is there any general literature that can be helpful in understanding taxonomic name resolution process?

Another doubt I have that will this be a separate option that we can apply on a specific database after it has already been fetched or will it act autometically when the data is downloaded and inserted in the database?

Lastly, will it be resonable to make the correction of names in a non destructiive way like creating a new database with corrected data or adding a new column in the tables that will contain the corrected data?

ethanwhite commented 8 years ago

Can you please provide me names of some datasets where this problem is very apparent?

Plants typically have the worst issues in this regard, so probably the Gentry dataset would be a good example of data that needs would have a lot of species names being replaced.

And may be some example of how some of the names can be reconciled? Or is there any general literature that can be helpful in understanding taxonomic name resolution process?

@sckott - can you take this one?

Another doubt I have that will this be a separate option that we can apply on a specific database after it has already been fetched or will it act autometically when the data is downloaded and inserted in the database?

This would happen automatically as the data is being downloaded and inserted into the database.

Lastly, will it be resonable to make the correction of names in a non destructiive way like creating a new database with corrected data or adding a new column in the tables that will contain the corrected data?

I think doing this non-destructively is definitely the way to go, either by adding a new table with just that information, or by attaching a new column(s) to an existing table containing species names.

ethanwhite commented 8 years ago

@ghoshbishakh - it may also be worth having a look at this issue from last year #1 for some more discussion. We didn't end up having a slot of this last year, which is why the project is still available, but there's some useful conversation about general approaches to this problem.

ghoshbishakh commented 8 years ago

@ethanwhite Thanks :smile: I am looking at pytaxize and will try to use it in ecodatareteiever. However I have found some bugs with pytaxize and I am trying to resolve them first.

sckott commented 8 years ago

sorry about the bugs @ghoshbishakh - haven't spent much time on it lately :)

And may be some example of how some of the names can be reconciled?

this depends on how you want to do it. pytaxize connects to a variety of web APIs for taxonomic databases, and some that are specifically meant for name resolution. e.g., you could query NCBI (a service that doesn't do name resolution, but rather just gives you taxonomic data) for names then do any name comparison/resolution client side in Python - OR you could use one of the APIs specifically for resolution and then just swap in the fixed names in Python

Or is there any general literature that can be helpful in understanding taxonomic name resolution process?

There is some literature on this topic. e.g.,

ghoshbishakh commented 8 years ago

@sckott bugs are not a problem I am trying to understand how it works so I will be glad to fix any bug if I can!

ghoshbishakh commented 8 years ago

For the Gentry database, querying tnrs for each row with pytaxize.tnrs_resolve() takes a really long time as the dataset itself is quite huge. Also it is completely dependent on the network speed and I really do not have a responsive internet connection. So I was wondering if this process can be moved to cloud. The dataset will be first downloaded in the server, acted on by the tnrs and returned to the client?

ethanwhite commented 8 years ago

For the Gentry database, querying tnrs for each row with pytaxize.tnrs_resolve() takes a really long time as the dataset itself is quite huge.

The good news is that Gentry is the pretty much the worst case scenario with needing to check about 7500 species values. We only need to check things that have been identified all the way to species (i.e., full_id == 1. We'll only want to check each unique species_id once for a total of 7354 checks, which will mean we will want to cache the species name replacements. If we do this in a pickled dictionary we could keep those checks semi-permanently so that if the internet connection fails all of the look ups that have been done already will still be there after a restart.

For thinking about the problem and exploring the design you should definitely feel free to just check a small sample of the species in Gentry.

I was wondering if this process can be moved to cloud. The dataset will be first downloaded in the server, acted on by the tnrs and returned to the client?

Unfortunately not. The retriever is designed to work locally like this because most of the datasets it handles either lack a formal license or have explicit "do not redistribute" clauses in their licenses. This means that we can't clean up the data and then provide it to the end users.

ethanwhite commented 8 years ago

@ghoshbishakh - another possible option would be to grab a dump of the full database when necessary and use it locally. @sckott has just announced a new R package for this: https://discuss.ropensci.org/t/taxonomic-databases-from-r/337 Adding these databases to the retriever and then using them for TNRS could be added to the proposal if you thought this was a good way to go.

ghoshbishakh commented 8 years ago

@ethanwhite Do all tnrs provide dumps?

also these databases are providing only the taxonomic data dumps I guess, so the matching and resolving algorithm we have to implement on it then I guess whereas in iPlant gives ready to use tnrs. Am I getting it correctly?

ethanwhite commented 8 years ago

I'm not sure. @skcott?

On Thursday, March 17, 2016, Bishakh Ghosh notifications@github.com wrote:

@ethanwhite https://github.com/ethanwhite Do all tnrs provide dumps?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/numfocus/gsoc/issues/105#issuecomment-197685956

sckott commented 8 years ago

Do all tnrs provide dumps?

No. And not all taxonomic databases do tnrs like things per se (they may have some fuzzy search built in, but are not meant to specifically do name resolution)

The thing most web services use is some implementation of taxamatch. The reference implementation is some weirdness implemented in Oracle PL/SQL

not sure if there's a python port of taxamatch

sckott commented 8 years ago

regarding the R pkg above for downloading taxonomic databases, there's just a thin layer there to help people that aren't familiar with databases, so it's easy enough to just re-write that for python, and there's others to include: e.g., NCBI is a big one - i know ive seen python scripts for downloading NCBI's weird dump format

ethanwhite commented 8 years ago

@ghoshbishakh - Just wanted to drop you a quick note to recommend that you post a draft proposal prior to the Friday deadline if you'd like us to comment and provide recommendations on it. That can either be done as a PR to this repo or using Google Docs through the GSoC website. If you post to the GSoC website please ping us in this issue so that we know to go check it out.

ethanwhite commented 8 years ago

Just a last reminder that final proposals are due on https://summerofcode.withgoogle.com/ today. I don't currently see your proposal there so just wanted to make sure you have the chance to submit one if you want to.

ghoshbishakh commented 8 years ago

@ethanwhite really sorry for late reply, I am really interested in this project but actually I was looking at another proposal also and I had already submitted the application for that. I am reluctant to submit multiple proposals actually and was improving the previous one.

ethanwhite commented 8 years ago

@ghoshbishakh - no problem at all. I think that's a good call. I just do my best to make sure that folks don't accidentally let a deadline slip by. Best of luck with your other proposal!

ghoshbishakh commented 8 years ago

@ethanwhite thanks!