mozillascience / software-discovery-dashboard

MIT License
19 stars 8 forks source link

How do we get *code* metadata though. #71

Open lukecoy opened 8 years ago

lukecoy commented 8 years ago

I love our repository harvesting. It's so great. I am a huge fan. :100: :100:

But, as per the synopsis and, well, the title of this project... we need to be able to collect Software. Harvesting repositories from sites like DataCite, Figshare, etc is nice, as these sites have the capacity to host research code. but how can we specifically filter these repositories so we are only showing research software?

lukecoy commented 8 years ago

I leaned a lot about the metadata fields at CodeMeta. And I think that to solve this problem (at least for Figshare), it may be possible to use DataCite entirely.

I wasn't too familiar with DataCites query language SOLR nor did I make our DataCite client (which is cool), but I figured out how to do a few things I don't think we knew we could do. For starters, say we wanted to do a search for biology on DataCite, BUT, filter the search to only show github related repositories (http://search.datacite.org/api?wt=json&q=biology&fq=relatedIdentifier:*github*) to get more perspective on software in datacite. We can do that with that query, for example.

What I noticed is that resourceType seems to often not be used. BUT, resourceTypeGeneral, a field that isn't in the CodeMeta crosswalk (that I think should maybe be there?) is always filled. and it is ALWAYS filled with Software in the example I just gave, and many other different search criteria.

lukecoy commented 8 years ago

So, that being said, how does this apply to a research repository that could /not/ contain code, like Figshare? Github only contains code so that's easy.

Well I'm glad you asked. What I'm thinking is we can combine the publisher field with the resourceTypeGeneral field to literally solve this problem COMPLETELY.

So basically, we need a query that ensures that resourceTypeGeneral:Software AND publisher:Figshare (well, at least in this one use case)

So, here's that query.. which returns Figshare published results, and returns software related results. Boom.

http://search.datacite.org/api?wt=json&q=science&fq=publisher:Figshare&fq=resourceTypeGeneral:Software&start=0

lukecoy commented 8 years ago

So... since any Figshare repo's that have a citable DOI get a DataCite DOI (see here & here,) that solves that problem.

I'm not sure if Zenodo works in the same way. Taking a really quick look at their API, specifically "Metadata Formats" seems promising. Super promising.

guys.. I think we just solved the big problem CC literally the entire universe @mok4ry @amb8805 @Lettuceman44 @acabunoc

amb8805 commented 8 years ago

Thats awesome!! Lets see what we can do with it!