sckott / habanero

client for Crossref search API
https://habanero.readthedocs.io
MIT License
207 stars 30 forks source link

Get DOI from query? Convert to dataframe? #107

Closed robtlx closed 2 years ago

robtlx commented 2 years ago

Hello!

How would I go about in extracting the DOI from a query result? I tried a variant from here but I get a KeyError on 'DOI' in [ z['DOI'] for z in x['message']['items'] ] and I don't really know how to proceed.

I tried converting the query results to a dataframe but that gives me most of the results under one single parameter instead of splitting them more tidily.

I'm still a beginner in Python so please keep in mind some terms might be confusing.

My endgame is to get a column of DOIs which I can then compare to another column I've already generated - seeing what relevant journals I haven't collected already.

Thank you!

robtlx commented 2 years ago

Managed to solve this by rewriting things around like: for i in crossref_results['message']['items']: doi = i['DOI']

But now I'm running into a different issue.. if I go along with the max results of 1000, everything is fine - but obviosuly I want more than 1000. If I do *cursor=()**, it runs for quite a while but then I get a "TypeError: list indices must be integers or slices, not str" for the first line (for i in crossref_results).

I tried printing the iterated element ("i" or "doi" in my case) but it doesn't - just hits me with this error.

Is anything possible?

sckott commented 2 years ago

thanks for your question

What Python version are you using? And what habanero version? You can get the habanero version like

import habanero
habanero.__version__

So if you run the below example, you get a key error?

from habanero import Crossref
cr = Crossref()
x = cr.works(filter = {'has_full_text': True})
[z['DOI'] for z in x['message']['items']]

If the above works for you, please share the full example so I can see why you are getting the error.

Yes, using cursor pagination will take a while if you are not filtering the query in any way since there are a lot of records to page through.

The docs you linked to has an example of how to work through the results from using cursor, see the example under the heading "# Deep paging, using the cursor parameter"

robtlx commented 2 years ago

Thank you for the reply and sorry to bother!

I'm on Habanero 1.2.2 and tried it on two different machines running Python 3.6 and 3.8.

I managed to work around the first version but tried running the code snippet you asked and it's now not giving any errors - just flagging the second statement as having no effect (I'm using PyCharm CE). It's not returning anything, either. Also, I am not interested in browsing through all full text publications - but more in searching for DOIs I already have. But I somehow worked around that by figuring out a different approach to the checking and am now looking at a more general level - specifically ISSNs, and I managed to succeed through looping through my ISSNs and querying cr.journals(ids='').

Thank you again for the help!

sckott commented 2 years ago

Great, nice work figuring it out