howisonlab / softcite-dataset

A gold-standard dataset of software mentions in research publications.
32 stars 50 forks source link

websites #611

Closed kermitt2 closed 5 years ago

kermitt2 commented 5 years ago

In these two documents for instance, we can see a large number of websites platform annotated as software (airbnb, facebook, blablacar, neighborgoods, justpark, Craigslist, ebay, etc.):

10.1007%2Fs00191-017-0548-y 10.1111%2Fjems.12230_TL02

In general, we can have some facebook or yahoo! annotated as software.

To help identifying them, I have disambiguated all software names against Wikidata. As the websites are common and well covered by Wikidata, website names are disambiguated and we can find in the statements explicit categorization as website.

See https://raw.githubusercontent.com/Impactstory/software-mentions/master/doc/software-term-vector-disambiguated.json

caifand commented 5 years ago

Thanks! The disambiguated software names are really very useful! I selected out software items with the "category" field containing substring = ['website', 'platform', 'company', 'service'] to maximize the results. Then I manually checked whether the filtered software are indeed web platforms by web search and reading its context. I end up with a list of 39 software names: web_platform_disambiguated.txt (Just got to know that github does not allow attached files in csv format ;() The mention_type of these 39 software_name labels has already been changed to "web platform" in our dataset. Some web platforms, like neighborgoods, justpark, as @kermitt2 mentioned, do not have detailed information in Wikidata. I just modified their mention_type in the dataset. In the future, these need to be manually coded. (our coding guideline is already updated correspondingly) Other things that Wikidata miss: It has been identified that, some software name labels in our dataset, such as DINO, DTS, Nutritionist, Scion, TM4, VISTA, are actual research software names. But in Wikidata they are homonyms referring to entities other than software. (~8% of this not representative sample :) CSV outputs already updated in the repo!