pybliometrics-dev / pybliometrics

Python-based API-Wrapper to access Scopus
https://pybliometrics.readthedocs.io/en/stable/
Other
410 stars 128 forks source link

Add funding id to Funding class #211

Closed astrochun closed 2 years ago

astrochun commented 2 years ago

Closes #210

This feature includes a funding_id (a.k.a. and award/grant number) to the Funding nametuple from Scopus xcors:funding-id data.

Note that more than one record is available - PR handles a single award (str) and multiples (list of dict). To ensure a consistent object type, I adopted the Funding funding_id field as a list of strings.

The example in the docs is of a record that does not have a funding-id data. Should we use another example then to reveal the full Funding data, @Michael-E-Rose?

astrochun commented 2 years ago

@Michael-E-Rose is this acceptable?

Michael-E-Rose commented 2 years ago

Hi! Fourmore things:

  1. Could you please rename attribute id to agency_id?
  2. As you suggested, could you please change the example in the docs, and also update them (i.e., show the new attribute)?
  3. Could you use the same example in the tests and update the tests?
  4. In lines 275-278, could you please use the .get() statement only once instead of four times, and instead of the if-else statement a try-except of some form? For example, try iterating like in line 278, and if that doesn't work, do the except statement that works on a single ID?
astrochun commented 2 years ago
  1. Could you please rename attribute id to agency_id?

I debated on doing this because it could break backward compatibility, right?

I'll work on 2, 3, and 4. I think I understand what you prefer (variable definition) and I can try try/except.

Michael-E-Rose commented 2 years ago

Re 1: Yes, that's true, but this is a minor thing: It doesn't affect a property, but instead a field name of a namedtuple which I think is not widely used. Additionally, the change is very small. We can capture that with a minor release.

Thanks for the new changes! You could simplify the new auxiliary function further:

def _funding_id(f_dict: dict) -> list:
    funding_get = f_dict.get('xocs:funding-id', [])
    try:
        return [v['$'] for v in funding_get]  # multiple or no data
    except TypeError:
        return [funding_get]  # single
astrochun commented 2 years ago

Re 1: Yes, that's true, but this is a minor thing: It doesn't affect a property, but instead a field name of a namedtuple which I think is not widely used. Additionally, the change is very small. We can capture that with a minor release.

OK.

Thanks for the new changes! You could simplify the new auxiliary function further:

def _funding_id(f_dict: dict) -> list:
    funding_get = f_dict.get('xocs:funding-id', [])
    try:
        return [v['$'] for v in funding_get]  # multiple or no data
    except TypeError:
        return [funding_get]  # single

Wouldn't it be better to have an empty list instead of a list with None when there's no xocs:funding-id?

Michael-E-Rose commented 2 years ago

If there is no 'xocs:funding-id' key in the json, then funding_get is an empty list. Iterating over the empty list with return [v['$'] for v in funding_get] returns an empty list as well.

astrochun commented 2 years ago

If there is no 'xocs:funding-id' key in the json, then funding_get is an empty list. Iterating over the empty list with return [v['$'] for v in funding_get] returns an empty list as well.

I don't think that is the case. Your example in the docs for eid='2-s2.0-85040230676' will return [None] when xocs:funding-id is not available (dict.get() always return None when a key is unavailable). Thus, this is the outcome with your suggestion:

In [2]: ab = abstract_retrieval.AbstractRetrieval('2-s2.0-85040230676', view='FULL')
In [3]: ab.funding
Out[3]: 
[Funding(agency=None, string='Astbury Biostructure Laboratory', id=None,
         funding_id=[None], acronym=None, country=None),
 Funding(agency=None, string='Netherlands Organization for Scientific Research', id=None,
         funding_id=['NWO-VICI-91812628'], acronym=None, country=None),
 Funding(agency='Wellcome Trust', string='Wellcome Trust', id='http://data.elsevier.com/vocabulary/SciValFunders/100010269',
         funding_id=['102572/B/13/Z'], acronym='WT', country='http://sws.geonames.org/2635167/'),
 Funding(agency='University of Leeds', string='University of Leeds', id='http://data.elsevier.com/vocabulary/SciValFunders/501100000777',
         funding_id=['108466/Z/15/Z'], acronym=None, country='http://sws.geonames.org/2635167/'),
 Funding(agency='Deutsche Forschungsgemeinschaft', string='German Research Foundation', id='http://data.elsevier.com/vocabulary/SciValFunders/501100001659',
         funding_id=['SFB685'], acronym='DFG', country='http://sws.geonames.org/2921044/')]
astrochun commented 2 years ago
  1. As you suggested, could you please change the example in the docs, and also update them (i.e., show the new attribute)?
  2. Could you use the same example in the tests and update the tests?

@Michael-E-Rose, I looked into this and it's a bit more complicated. First, the example in the docs is different from the one in the tests. The one used in tests for funding ("2-s2.0-85040230676") is also being used for chemicals and sequencebank. As such, a change to the test to match the docs requires updating other tests functions. I therefore left the same eid in the test and simply updated the test assertions to ensure that it passes.

As far as the docs, the one used in test would provide a very long example as it has many Funding object. This would also require changing the ab_funding.chemicals and ab_funding.sequencebank, which are longer. IMHO, the example you had should be sufficient to expose the AbstractRetrieval.funding property. It will show:


>>> ab_fund = AbstractRetrieval("2-s2.0-85053478849", view="FULL")
>>> ab_fund.funding
[Funding(agency=None, string='CNRT “Nickel et son Environnement',
 agency_id=None, funding_id=[], acronym=None, country=None)]
>>> ab_fund.funding_text
'The authors gratefully acknowledge CNRT “Nickel et son Environnement” for
providing the financial support. The results reported in this publication
are gathered from the CNRT report “Ecomine BioTop”.'
>>> ab_fund.chemicals
[Chemical(source='esbd', chemical_name='calcium', cas_registry_number='7440-70-2;14092-94-5'),
 Chemical(source='esbd', chemical_name='magnesium', cas_registry_number='7439-95-4'),
 Chemical(source='nlm', chemical_name='Fertilizers', cas_registry_number=None),
 Chemical(source='nlm', chemical_name='Sewage', cas_registry_number=None),
 Chemical(source='nlm', chemical_name='Soil', cas_registry_number=None)]
>>> ab_fund.sequencebank
[Sequencebank(name='GENBANK', sequence_number='MH150839:MH150870', type='submitted')]
astrochun commented 2 years ago

@Michael-E-Rose, I've tested this dev version and it works as intended for several hundred records. Could this be merged? See above for additional comments. I've updated the code, docs, and tests.

Michael-E-Rose commented 2 years ago

Hi again! Sorry, usually I respond faster, but you caught me during holidays. In the example document that you use, is the 'xocs:funding-id missing or is its value None? Because if it's former, then one can do as I suggested (which would be marginally faster and requires less code). The trick is to change the default value of .get(): f_dict.get('xocs:funding-id', []) - instead of None, it returns []. However, one should change the next line to return [v['$'] for v in funding_get] or None - because I want to avoid empty lists.

About the examples for the docs: No problem, I will take care of that.

astrochun commented 2 years ago

Hi again! Sorry, usually I respond faster, but you caught me during holidays. In the example document that you use, is the 'xocs:funding-id missing or is its value None? Because if it's former, then one can do as I suggested (which would be marginally faster and requires less code). The trick is to change the default value of .get(): f_dict.get('xocs:funding-id', []) - instead of None, it returns []. However, one should change the next line to return [v['$'] for v in funding_get] or None - because I want to avoid empty lists.

Nice idea. Done and Done.

Any interest in participating in Hacktoberfest?

Michael-E-Rose commented 2 years ago

Thanks for the quick commit! One last think I just stumbled upon while looking through the changes: Does it happen that there are multiple funding IDs for one agency ID? In other words, is this a 1:n relationship or 1:1? Because if it's the latter, then putting the funding_id into a list is not the optimal choice...

Michael-E-Rose commented 2 years ago

PS: I'll have a look at the Hacktoberfest, although the real one would have ended last weekend ;)

astrochun commented 2 years ago

Thanks for the quick commit! One last think I just stumbled upon while looking through the changes: Does it happen that there are multiple funding IDs for one agency ID? In other words, is this a 1:n relationship or 1:1? Because if it's the latter, then putting the funding_id into a list is not the optimal choice...

It can be 1:n, which is why I was pushing for a list (even empty) that way it was consistent throughout.

Here's an example with single, multiple, and no record:

In [1]: from pybliometrics.scopus import AbstractRetrieval
In [2]: ab = AbstractRetrieval('10.1016/j.physleta.2020.126262', view='FULL')
In [3]: ab.funding
Out[3]: 
[Funding(agency=None, string='NSF PHY-1506122', agency_id=None,
         funding_id=['DOE DE-FG02-97ER25308'], acronym=None, country=None),
 Funding(agency='National Science Foundation', string='NSF', agency_id='http://data.elsevier.com/vocabulary/SciValFunders/100000001',
         funding_id=['PHY-1506122'], acronym='NSF', country='http://sws.geonames.org/6252001/'),
 Funding(agency='U.S. Department of Energy', string='DOE', agency_id='http://data.elsevier.com/vocabulary/SciValFunders/100000015',
         funding_id=['DE-FG02-97ER25308', 'DE-NA0003764', 'NNSA 83228-10966'], acronym='USDOE', country='http://sws.geonames.org/6252001/'),
 Funding(agency='National Nuclear Security Administration', string='NNSA', agency_id='http://data.elsevier.com/vocabulary/SciValFunders/100006168',
         funding_id=None, acronym='NNSA', country='http://sws.geonames.org/6252001/')]
astrochun commented 2 years ago

PS: I'll have a look at the Hacktoberfest, although the real one would have ended last weekend ;)

@Michael-E-Rose, the Hacktoberfest page says that a PR with the hacktoberfest-accepted label should suffice. If you wish to participate as as maintainer I do think you need to update the topics to include hacktoberfest

Michael-E-Rose commented 2 years ago

And what would be the main benefit of participating?

astrochun commented 2 years ago

And what would be the main benefit of participating?

You can identify some issues that you feel are good first issues for other OSS developers to help improve your codebase. Also, for maintainers, you can get a free T-shirt: https://hacktoberfest.digitalocean.com/resources/maintainers

Michael-E-Rose commented 2 years ago

I keep this in mind for next year! Right now I can't think of any issue; pybliometrics todate is mostly reacting to changes of Scopus.