ropensci / traits

R package for accessing species trait data from multiple databases
Other
39 stars 14 forks source link

Extending ncbi_byid #101

Closed boopsboops closed 6 years ago

boopsboops commented 6 years ago

Hi @sckott et al (and @dwinter too),

I just stumbled on the function ncbi_byid after it cropped in the r-sig-phylo list.

I'm working on assembling reference libraries for eDNA metabarcoding, and being able to curate, update and review the quality of your reference libraries is really important. Therefore having this data in a table, rather than fasta format is essential.

Your function seems super quick, and gives back a lovely table. However, it's lacking a lot of the fields I would require for better filtering and quality control of reference libraries, such as lat_lon, country, specimen_voucher, publication status etc etc etc.

All this kind of info is usually in the GenBank metadata, and I've already implemented such a function to "tabulize" it (gb2df), but I think you will agree, it's an absolute abomination. I used EBI rather than NCBI as it's much faster to download large numbers of sequences to a local tempfile, and then did a lot of multithreaded XML scraping (which is very inefficient). https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df_example.R https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df.R

So, if you guys think that these would be important or appropriate additions to ncbi_byid, I'll be more than happy to help (although I confess to not really getting my head around how the function works yet).

Cheers,

Rupert

sckott commented 6 years ago

thanks for the issue @boopsboops

There's a few things going on here, so separating them out:

also:

@dwinter is there anything else to mention here?

boopsboops commented 6 years ago

Hi @sckott

So, I had a much better look at your code, and it seems fairly trivial to insert the things I need, and have successfully done so. What threw me was that you were using "GBSeq" XML style, which I have never seen before, and did not know was available.

Anyway, if you think it will be valuable to offer these extra fields in your function, I can post the code here or do a pull request. If not, I'm happy to just fork and use it locally for my own work.

Cheers!

sckott commented 6 years ago

Thanks for the follow up. A PR with your changes sounds good

sckott commented 6 years ago

@boopsboops does your merged PR solve all issues you had?

boopsboops commented 6 years ago

Sorry for the delay. Yes, it did. Thanks for the help :)