howisonlab / screenit-softcite

Creative Commons Zero v1.0 Universal
2 stars 2 forks source link

collecting mentions after processing #4

Closed jameshowison closed 1 year ago

jameshowison commented 1 year ago

I successfully processed the 1500 PMCIDs ... so now I have .software.json files inside the ./data directory hierarchy.

That's great. My goal here is to get output as a CSV with one line per mention. @kermitt2 any suggestions on how to go about that? I can think of two options:

  1. write new code to recurse through the directories, collecting the json and unnesting each entry in the ./mentions list to a pandas dataframe
  2. Using the 'write to mongo' mentioned here: https://github.com/softcite/software_mentions_client#processing-a-collection-of-pdf-harvested-by-biblio-glutton-harvester-or-article-dataset-builder (which I'm assuming is the route that the softcite-db takes)

With approach 2 the export then becomes a query against the mongo db, right? That seems like the right way to go.

Hmmm, but https://github.com/softcite/softcite_kb uses ArangoDB, I'm thinking that was a later approach (and the MongoDB during mention extraction is deprecated?). So looks like importing the extracted mentions into the softcite_kb happened like this: https://github.com/softcite/softcite_kb#import-software-mentions. Ah, wait, that approach uses the export from MongoDB, so extractions should create a MongoDB collection (which I guess can then be queried directly or exported via this script for adding into the ArangoDB for softcite_kb.

Given that disambiguation etc happened during the creation of softcite_kb, looks like the right approach for comparisons for ScreenIT is to do the full workflow to softcite_kb, right? Then use the api to create the export to CSV?

kermitt2 commented 1 year ago

Yes the approach 2 is how I managed the annotations and then loaded them to softcite_kb - via a MongoDB export, which is loaded by softcite_kb into ArangoDB. The advantage of storing in MongoDB is that it can provide statistics with some basic mongo queries (the query language is simple) and it's faster than going to plenty of json files. But approach 1 is simpler?

The CSV export can be done after loading the annotations into MongoDB I think, no need to go through softcite_kb which adds quite some complexity (in particular additional install of arangoDB) and it is relatively time consuming due to the arangodb loading (although 1500 documents is very little!). The disambiguation done by softcite_kb is at corpus level. If you need to compare/benchmark the software mentions in the 1500 PMC full texts, there is no particular need to create the "corpus" level representation of the software entities I guess.

jameshowison commented 1 year ago

Hmmm, the question of disambiguation for reasonable comparisons between other tools and ours on smaller datasets is an interesting one.

Perhaps rather than an isolated run of the 1,500 pdfs I should be querying the softcite_kb for mentions found in those 1,500 articles. That way 'corpus level' elements have already occurred but within the wider scope of the corpus. But that makes extractions on the focal set dependent on the extent of other data in the softcite_kb.

This is relevant for anyone that might be processing non-open articles (ie that can't be added to the public softcite_kb, but might still want 'corpus level' disambiguation etc.

I shall think more on it, but proceed without softcite_kb step for the ScreenIT 1,500 articles :). even if we decide to submit extractions taken from the larger softcite_kb it's good for me to be more familiar with the various steps!

jameshowison commented 1 year ago

Ok, I did this via collecting the .software.json files and pandas munging. See https://github.com/howisonlab/screenit-softcite/commit/f5fc05553941a4c1f1703c16cf0221763e49fb7a